elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.51k stars 24.89k forks source link

Async search can throw NPE from onClusterResponseMinimizeRoundtrips #116341

Open quux00 opened 3 weeks ago

quux00 commented 3 weeks ago

Elasticsearch Version

Reported from 8.13.4

Installed Plugins

No response

Java Version

bundled

OS Version

Any

Problem Description

Stack trace seen from a customer environment:

`[2024-10-23T13:34:32,289][WARN ][o.e.a.s.SearchProgressListener] [data-2] [ibprod_azure] Failed to execute progress listener onResponseMinimizeRoundtrips
java.lang.NullPointerException: Cannot invoke "org.elasticsearch.xpack.search.MutableSearchResponse.updateResponseMinimizeRoundtrips(String, org.elasticsearch.action.search.SearchResponse)" because the return value of "org.apache.lucene.util.SetOnce.get()" is null
at org.elasticsearch.xpack.search.AsyncSearchTask$Listener.onClusterResponseMinimizeRoundtrips(AsyncSearchTask.java:512) ~[?:?]
at org.elasticsearch.action.search.SearchProgressListener.notifyClusterResponseMinimizeRoundtrips(SearchProgressListener.java:181) ~[elasticsearch-8.13.4.jar:?]
at org.elasticsearch.action.search.TransportSearchAction$3.innerOnResponse(TransportSearchAction.java:781) ~[elasticsearch-8.13.4.jar:?]
at org.elasticsearch.action.search.TransportSearchAction$3.innerOnResponse(TransportSearchAction.java:776) ~[elasticsearch-8.13.4.jar:?]
at org.elasticsearch.action.search.TransportSearchAction$CCSActionListener.onResponse(TransportSearchAction.java:1487) ~[elasticsearch-8.13.4.jar:?]
at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:48) ~[elasticsearch-8.13.4.jar:?]
at org.elasticsearch.transport.TransportService$UnregisterChildTransportResponseHandler.handleResponse(TransportService.java:1742) ~[elasticsearch-8.13.4.jar:?]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1465) ~[elasticsearch-8.13.4.jar:?]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1465) ~[elasticsearch-8.13.4.jar:?]
at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:433) ~[elasticsearch-8.13.4.jar:?]
at org.elasticsearch.transport.InboundHandler$2.doRun(InboundHandler.java:390) ~[elasticsearch-8.13.4.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.13.4.jar:?]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) ~[elasticsearch-8.13.4.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.13.4.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.13.4.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]

We need to try to reproduce it but this will happen when AsyncSearchTask.onListShards is not called (initializing the MutableSearchResponse in that class) before AsyncSearchTask.onClusterResponseMinimizeRoundtrips is called. At a minimum we need to add some defensive programming in the latter method. Optimally we will find why onListShards is not getting called (or perhaps not succeeding) for this case and fix the underlying issue.

Steps to Reproduce

Not yet known.

Logs (if relevant)

No response

elasticsearchmachine commented 3 weeks ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)

pawankartik-elastic commented 1 week ago

I've looked into this and I believe it's caused by a race condition. The NPE is caused in AsyncSearchTask#onClusterResponseMinimizeRoundtrips() because searchResponse isn't initialised at that point (which is done by AsyncSearchTask#onListShards()) and there exists no codepath where onClusterResponseMinimizeRoundtrips() is called before onListShards(). The race condition that leads to this bug can be reproduced by placing a breakpoint at someplace before the control hits onListShards() (for e.g., in SearchQueryThenFetchAsyncAction#SearchQueryThenFetchAsyncAction()) and resuming the execution when it is hit. I'd like to believe that this bug isn't easy to hit in production and should be a rare occurrence. However, this can be exacerbated by other factors so it's not sure how easily the customer is hitting it.