elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.15k stars 24.84k forks source link

Inference request task timeout should not be 503 #112828

Open ChrisHegarty opened 2 months ago

ChrisHegarty commented 2 months ago

Timeouts on the inference service should not result in a 503. The rest suppressed logger reported this.

Interestingly the timeout is 10s, while the default is not 30s ?

"error.stack_trace": "org.elasticsearch.ElasticsearchTimeoutException: Request timed out waiting to be sent after [10s]
  at org.elasticsearch.inference@8.16.0/org.elasticsearch.xpack.inference.external.http.sender.RequestTask.lambda$getListener$2(RequestTask.java:67)
 at org.elasticsearch.server@8.16.0/org.elasticsearch.action.support.ListenerTimeouts$TimeoutableListener.run(ListenerTimeouts.java:102)
 at org.elasticsearch.server@8.16.0/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:924)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 at java.base/java.lang.Thread.run(Thread.java:1570)\n",
elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)

javanna commented 2 months ago

I marked this Search relevance to check the question about the timeout and also because the exception is thrown in the inference related code. Given a cause is not provided, the status code will be 503, which is trappy. We have recently made a similar change in some of the search code and moved to 504 instead, see SearchTimeoutException. We had a bit of back and forth and decided not to use 4xx for search timeout errors, not sure what this specific timeout exception means though in this context.

pmpailis commented 1 month ago

IIUC one place where we set the timeout to 10s is the semantic_text rewriting (SemanticQueryBuilder#doRewriteGetInferenceResults

InferenceAction.Request inferenceRequest = new InferenceAction.Request(
                TaskType.ANY,
                inferenceId,
                null,
                List.of(query),
                Map.of(),
                InputType.SEARCH,
                InferModelAction.Request.DEFAULT_TIMEOUT_FOR_API,
                false
            );

but I can't trace the call from just the logged message. Maybe the ML/Inference team might be in a better position to take a look at this (and change the exception if needed)

maxhniebergall commented 1 month ago

Looking into this a bit, we don't have a cause for this exception because it's directly at the source of the timeout, there isn't any other underlying exception that we've caught at that point. I'm happy to change the exception, but I am not sure what would be better than this. Seems like we should consider just increasing the timeout above 10s?

see https://github.com/elastic/elasticsearch/issues/114419

elasticsearchmachine commented 1 month ago

Pinging @elastic/ml-core (Team:ML)