Fix HTTP response code for cluster health API

DaveCTurner commented 3 years ago

Today GET _cluster/health returns 408 Request timeout if it times out before the desired condition is reached. This response code is to indicate that the server timed out waiting for a request from the client, so it isn't appropriate here at all. There's no great fit for a server-side timeout response code, I suggest 503.

This API already returns 503 Service unavailable if there is no master, after waiting a while, but immediately returns 200 OK if there is a master but STATE_NOT_RECOVERED_BLOCK is in place. I think we should wait and return 503 in the presence of the STATE_NOT_RECOVERED_BLOCK too.

elasticmachine commented 3 years ago

Pinging @elastic/es-distributed (Team:Distributed)

Mpdreamz commented 3 years ago

cc @elastic/clients-team :wave:

DaveCTurner commented 3 years ago

We (@elastic/es-distributed) discussed this in our team sync yesterday. HTTP lacks a usefully descriptive response code for this case (see also #35582). 408 is definitely wrong, but 503 is also inappropriate since the service was available and able to handle the request. We decided to move to 200 since the request was handled as desired, and the response body contains valid information even if the requested state was not reached within the time limit. This change will bring this endpoint in line with bulk indexing and other stats and monitoring endpoints that return 200 if the coordination of the request was successful even if it there were some subsidiary failures during processing.

Future clients will check the timed_out field of the response to determine whether to retry or not. Recognising that today's clients may be relying on the HTTP response code we'll treat this bugfix as a breaking change with the usual deprecation warnings etc. We will also introduce a system property that lets users opt into the future behaviour before it becomes mandatory.

loretoparisi commented 2 years ago

I'm recently facing this error again

error {
  "status": 408,
  "displayName": "RequestTimeout",
  "message": "Request Timeout after 30000ms"
}

j-bennet commented 2 years ago

Looks like the fix was reverted, and as of 8.2.2, 408 is still returned:

> curl -D - -k -u elastic:changeme "https://localhost:9200/_cluster/health?wait_for_nodes=2&timeout=1ms&pretty"
HTTP/1.1 408 Request Timeout
X-elastic-product: Elasticsearch
content-type: application/json
content-length: 465

{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : true,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 2,
  "active_shards" : 2,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Elasticsearch version:

> curl -k -u elastic:changeme "https://localhost:9200"
{
  "name" : "eli.local",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "iYa9R7SbRmqbW1bwWGs1aA",
  "version" : {
    "number" : "8.2.2",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "9876968ef3c745186b94fdabd4483e01499224ef",
    "build_date" : "2022-05-25T15:47:06.259735307Z",
    "build_snapshot" : false,
    "lucene_version" : "9.1.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

What happened?

DaveCTurner commented 2 years ago

Ah we forgot to reopen this, sorry.

elastic / elasticsearch

Fix HTTP response code for cluster health API #70849