elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
770 stars 24.79k forks source link

Allow for RoutingAllocation DebugMode to skip string formatting in results #91167

Open jbaiera opened 2 years ago

jbaiera commented 2 years ago

The allocation decider framework allows for a debug mode to be specified which preserves results from each decider and includes additional information about the decision. This debug mode is primarily used to generate the response for the Allocation Explain API. Recently the new Health API has made use of this debug mode in order to find and report on situations where allocation can be blocked because of interactions across multiple deciders.

We found that debug mode is (not surprisingly) less performant than running without debug mode enabled. However, upon studying some benchmark results, we found that one of the most expensive parts of the operation is formatting human readable debugging information (benchmark results). The Health API does not actually make use of this human readable debugging information, it only cares about which results were returned from each decider.

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-data-management (Team:Data Management)

jbaiera commented 2 years ago

Some optimization thoughts I've had that would be great to hear opinions about:

A debug mode (DebugMode.QUIET perhaps?) that simply accumulates the YES/NO answers from deciders without the overhead of text formatting human responses. It would simply elide the explanation field from the Decision object.

Since the Health API cares primarily for a specific set of deciders that it knows how to troubleshoot, perhaps we could inform the allocation framework to target only those deciders when doing the explain operation? This has the added benefit of short circuiting the health logic: If the explain response for the deciders it knows how to troubleshoot says the shard can be allocated, then we know that the shard is unallocated for a reason outside of the Health API's understanding. The diagnosis steps can be largely skipped.

One option would be to instantiate just the deciders in the Health API and use them to diagnose shards in a bespoke manner, but I worry that this repetition of logic can lead to discrepancies between the Health API's read of what an allocation should be and what the actual Allocation Service has accurately decided on. This seems most likely to evolve into a code smell.

HamzaTatheer commented 2 years ago

Hi, I wanted to know if this is a good first issue. In that case, Can I contribute to this issue ?

jbaiera commented 1 year ago

@HamzaTatheer Thanks for your interest in this.

I think there is some more room to discuss potential solutions here before anyone starts development on this. I don't know if this is a good first issue or not yet without a more concrete plan of how we'd like to address this performance problem. Keep an eye out here though, perhaps it becomes a good first issue after that plan comes together?

HamzaTatheer commented 1 year ago

Thanks @jbaiera. I would really like to work on this issue. Once the plan solidifies, I can try on my side to fix this. I will also try to look into possible solutions right now. But I still need to know that is this enhancement a needed one or would they reject it if they do not prefer to make it part of elastic search