Open quux00 opened 3 weeks ago
Pinging @elastic/es-search-foundations (Team:Search Foundations)
I've looked into this and I believe it's caused by a race condition. The NPE is caused in AsyncSearchTask#onClusterResponseMinimizeRoundtrips()
because searchResponse
isn't initialised at that point (which is done by AsyncSearchTask#onListShards()
) and there exists no codepath where onClusterResponseMinimizeRoundtrips()
is called before onListShards()
. The race condition that leads to this bug can be reproduced by placing a breakpoint at someplace before the control hits onListShards()
(for e.g., in SearchQueryThenFetchAsyncAction#SearchQueryThenFetchAsyncAction()
) and resuming the execution when it is hit. I'd like to believe that this bug isn't easy to hit in production and should be a rare occurrence. However, this can be exacerbated by other factors so it's not sure how easily the customer is hitting it.
Elasticsearch Version
Reported from 8.13.4
Installed Plugins
No response
Java Version
bundled
OS Version
Any
Problem Description
Stack trace seen from a customer environment:
We need to try to reproduce it but this will happen when AsyncSearchTask.onListShards is not called (initializing the MutableSearchResponse in that class) before AsyncSearchTask.onClusterResponseMinimizeRoundtrips is called. At a minimum we need to add some defensive programming in the latter method. Optimally we will find why onListShards is not getting called (or perhaps not succeeding) for this case and fix the underlying issue.
Steps to Reproduce
Not yet known.
Logs (if relevant)
No response