Open idegtiarenko opened 1 month ago
Pinging @elastic/es-distributed (Team:Distributed)
Do we need client caching too? Server caching should help when there are few nodes with large number of shards. But if there many nodes that frequently poll stats, networking overhead will cripple up. May be distribute load, client picks random node from cluster and if that node does not have it in cache, forward to master and populate cache.
TransportGetAllocationStatsAction$Response
is pretty small, ~100B/node or thereabouts, I think caching it on the master is sufficient.
Elasticsearch Version
8.14
Installed Plugins
No response
Java Version
bundled
OS Version
any
Problem Description
TransportGetAllocationStatsAction runs on elected master so it is theoretically possible to overload it by executing node stats requests around various nodes in cluster, especially in a clusters with many shards as complexity is proportional to the shard count.
The aggregated result of the computation is small (5 numbers per node), we should consider caching it for small period of time (1 minute?) and reuse it between different calls during.
Steps to Reproduce
n/a
Logs (if relevant)
No response
Please let me know if this issue is open. If it is can you please assign it to me.
TransportGetAllocationStatsAction$Response
is pretty small, ~100B/node or thereabouts, I think caching it on the master is sufficient.
Please let me know if this issue is open. If it is can you please assign it to me.
Still available @shreedaddy thanks for the offer. We can't assign issues to folks outside the @elastic org but if you want to contribute a PR then please feel free.
Still available @shreedaddy thanks for the offer. We can't assign issues to folks outside the @elastic org but if you want to contribute a PR then please feel free.
Will get started.
Still available @shreedaddy thanks for the offer. We can't assign issues to folks outside the @elastic org but if you want to contribute a PR then please feel free.
Dave I have already completed analysis but to get started on the development I do not have access to create a new branch for this issue. Can you please provide this particular privilege.
TransportGetAllocationStatsAction$Response
is pretty small, ~100B/node or thereabouts, I think caching it on the master is sufficient.
Dave, Per the issue description TransportGetAllocationStatsAction runs in the elected master. After looking into the code TransportGetAllocationStatsAction is used to provide some additional paramters like TYPE, identifier for the node.So TransportGetAllocationStatsAction does not contribute to the response payload.TransportNodesStatsAction is the class which responds with node statsitics and is the one that generates response payload. Below is the line that where it does that.
"newResponse(request, merge(responses, r.getNodeAllocationStats(), r.getDiskThresholdSettings()), failures)"
Based on the above it is my understanding we need cache the responses parameter instead of TransportGetAllocationStatsAction. Please let me know if my understanding is correct or if I am missing something.
Hi @shreedaddy and thanks for your interest here. A PR would be welcome, but I don't really have the capacity to offer you as much help as you are asking for here. If you cannot work this out yourself from the code and surrounding documentation, please pick a different issue to address.
Dave I have already completed analysis but to get started on the development I do not have access to create a new branch for this issue. Can you please provide this particular privilege.
Dave I have already completed analysis but to get started on the development I do not have access to create a new branch for this issue. Can you please provide this particular privilege.
Dave I have already completed analysis but to get started on the development I do not have access to create a new branch for this issue. Can you please provide this particular privilege.
No, you already have all the privileges you need to open a PR based on a branch in your own personal fork.
It would be better to use TransportClusterAllocationExplainAction instead of TransportGetAllocationStatsAction. Of course, the result must be different from what you want, but if the issue occurs, TransportClusterAllocationExplainAction is more suitable than TransportGetAllocationStatsAction. I’m still gaining experience, so I wonder if this idea could be beneficial. I hope you don't mind me suggesting this, as I'm still learning. I appreciate any feedback you have. Thanks.
Elasticsearch Version
8.14
Installed Plugins
No response
Java Version
bundled
OS Version
any
Problem Description
TransportGetAllocationStatsAction runs on elected master so it is theoretically possible to overload it by executing node stats requests around various nodes in cluster, especially in a clusters with many shards as complexity is proportional to the shard count.
The aggregated result of the computation is small (5 numbers per node), we should consider caching it for small period of time (1 minute?) and reuse it between different calls during.
https://github.com/elastic/elasticsearch/blob/5e5205994747618edf5d81cb6a813adef9f2dbce/server/src/main/java/org/elasticsearch/action/admin/cluster/node/stats/TransportNodesStatsAction.java#L88-L96
Steps to Reproduce
n/a
Logs (if relevant)
No response