elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.42k stars 24.57k forks source link

TransportGetAllocationStatsAction may cause significant load on elected master #110716

Open idegtiarenko opened 1 month ago

idegtiarenko commented 1 month ago

Elasticsearch Version

8.14

Installed Plugins

No response

Java Version

bundled

OS Version

any

Problem Description

TransportGetAllocationStatsAction runs on elected master so it is theoretically possible to overload it by executing node stats requests around various nodes in cluster, especially in a clusters with many shards as complexity is proportional to the shard count.

The aggregated result of the computation is small (5 numbers per node), we should consider caching it for small period of time (1 minute?) and reuse it between different calls during.

https://github.com/elastic/elasticsearch/blob/5e5205994747618edf5d81cb6a813adef9f2dbce/server/src/main/java/org/elasticsearch/action/admin/cluster/node/stats/TransportNodesStatsAction.java#L88-L96

Steps to Reproduce

n/a

Logs (if relevant)

No response

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-distributed (Team:Distributed)

mhl-b commented 1 month ago

Do we need client caching too? Server caching should help when there are few nodes with large number of shards. But if there many nodes that frequently poll stats, networking overhead will cripple up. May be distribute load, client picks random node from cluster and if that node does not have it in cache, forward to master and populate cache.

DaveCTurner commented 1 month ago

TransportGetAllocationStatsAction$Response is pretty small, ~100B/node or thereabouts, I think caching it on the master is sufficient.

shreedaddy commented 1 month ago

Elasticsearch Version

8.14

Installed Plugins

No response

Java Version

bundled

OS Version

any

Problem Description

TransportGetAllocationStatsAction runs on elected master so it is theoretically possible to overload it by executing node stats requests around various nodes in cluster, especially in a clusters with many shards as complexity is proportional to the shard count.

The aggregated result of the computation is small (5 numbers per node), we should consider caching it for small period of time (1 minute?) and reuse it between different calls during.

https://github.com/elastic/elasticsearch/blob/5e5205994747618edf5d81cb6a813adef9f2dbce/server/src/main/java/org/elasticsearch/action/admin/cluster/node/stats/TransportNodesStatsAction.java#L88-L96

Steps to Reproduce

n/a

Logs (if relevant)

No response

Please let me know if this issue is open. If it is can you please assign it to me.

shreedaddy commented 1 month ago

TransportGetAllocationStatsAction$Response is pretty small, ~100B/node or thereabouts, I think caching it on the master is sufficient.

Please let me know if this issue is open. If it is can you please assign it to me.

DaveCTurner commented 1 month ago

Still available @shreedaddy thanks for the offer. We can't assign issues to folks outside the @elastic org but if you want to contribute a PR then please feel free.

shreedaddy commented 1 month ago

Still available @shreedaddy thanks for the offer. We can't assign issues to folks outside the @elastic org but if you want to contribute a PR then please feel free.

Will get started.

shreedaddy commented 1 month ago

Still available @shreedaddy thanks for the offer. We can't assign issues to folks outside the @elastic org but if you want to contribute a PR then please feel free.

Dave I have already completed analysis but to get started on the development I do not have access to create a new branch for this issue. Can you please provide this particular privilege.

shreedaddy commented 1 month ago

TransportGetAllocationStatsAction$Response is pretty small, ~100B/node or thereabouts, I think caching it on the master is sufficient.

Dave, Per the issue description TransportGetAllocationStatsAction runs in the elected master. After looking into the code TransportGetAllocationStatsAction is used to provide some additional paramters like TYPE, identifier for the node.So TransportGetAllocationStatsAction does not contribute to the response payload.TransportNodesStatsAction is the class which responds with node statsitics and is the one that generates response payload. Below is the line that where it does that.

"newResponse(request, merge(responses, r.getNodeAllocationStats(), r.getDiskThresholdSettings()), failures)"

Based on the above it is my understanding we need cache the responses parameter instead of TransportGetAllocationStatsAction. Please let me know if my understanding is correct or if I am missing something.

DaveCTurner commented 1 month ago

Hi @shreedaddy and thanks for your interest here. A PR would be welcome, but I don't really have the capacity to offer you as much help as you are asking for here. If you cannot work this out yourself from the code and surrounding documentation, please pick a different issue to address.

shreedaddy commented 1 month ago

Dave I have already completed analysis but to get started on the development I do not have access to create a new branch for this issue. Can you please provide this particular privilege.

Dave I have already completed analysis but to get started on the development I do not have access to create a new branch for this issue. Can you please provide this particular privilege.

DaveCTurner commented 1 month ago

Dave I have already completed analysis but to get started on the development I do not have access to create a new branch for this issue. Can you please provide this particular privilege.

No, you already have all the privileges you need to open a PR based on a branch in your own personal fork.

dragan509 commented 1 month ago

It would be better to use TransportClusterAllocationExplainAction instead of TransportGetAllocationStatsAction. Of course, the result must be different from what you want, but if the issue occurs, TransportClusterAllocationExplainAction is more suitable than TransportGetAllocationStatsAction. I’m still gaining experience, so I wonder if this idea could be beneficial. I hope you don't mind me suggesting this, as I'm still learning. I appreciate any feedback you have. Thanks.