Open DaveCTurner opened 3 years ago
Pinging @elastic/es-distributed (Team:Distributed)
cc @elastic/clients-team :wave:
We (@elastic/es-distributed) discussed this in our team sync yesterday. HTTP lacks a usefully descriptive response code for this case (see also #35582). 408 is definitely wrong, but 503 is also inappropriate since the service was available and able to handle the request. We decided to move to 200 since the request was handled as desired, and the response body contains valid information even if the requested state was not reached within the time limit. This change will bring this endpoint in line with bulk indexing and other stats and monitoring endpoints that return 200 if the coordination of the request was successful even if it there were some subsidiary failures during processing.
Future clients will check the timed_out
field of the response to determine whether to retry or not. Recognising that today's clients may be relying on the HTTP response code we'll treat this bugfix as a breaking change with the usual deprecation warnings etc. We will also introduce a system property that lets users opt into the future behaviour before it becomes mandatory.
I'm recently facing this error again
error {
"status": 408,
"displayName": "RequestTimeout",
"message": "Request Timeout after 30000ms"
}
Looks like the fix was reverted, and as of 8.2.2, 408 is still returned:
> curl -D - -k -u elastic:changeme "https://localhost:9200/_cluster/health?wait_for_nodes=2&timeout=1ms&pretty"
HTTP/1.1 408 Request Timeout
X-elastic-product: Elasticsearch
content-type: application/json
content-length: 465
{
"cluster_name" : "elasticsearch",
"status" : "green",
"timed_out" : true,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 2,
"active_shards" : 2,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
Elasticsearch version:
> curl -k -u elastic:changeme "https://localhost:9200"
{
"name" : "eli.local",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "iYa9R7SbRmqbW1bwWGs1aA",
"version" : {
"number" : "8.2.2",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "9876968ef3c745186b94fdabd4483e01499224ef",
"build_date" : "2022-05-25T15:47:06.259735307Z",
"build_snapshot" : false,
"lucene_version" : "9.1.0",
"minimum_wire_compatibility_version" : "7.17.0",
"minimum_index_compatibility_version" : "7.0.0"
},
"tagline" : "You Know, for Search"
}
What happened?
Ah we forgot to reopen this, sorry.
Today
GET _cluster/health
returns408 Request timeout
if it times out before the desired condition is reached. This response code is to indicate that the server timed out waiting for a request from the client, so it isn't appropriate here at all. There's no great fit for a server-side timeout response code, I suggest503
.This API already returns
503 Service unavailable
if there is no master, after waiting a while, but immediately returns200 OK
if there is a master butSTATE_NOT_RECOVERED_BLOCK
is in place. I think we should wait and return503
in the presence of theSTATE_NOT_RECOVERED_BLOCK
too.