elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.77k stars 24.69k forks source link

Add node setting to disable all fsync #96302

Open original-brownbear opened 1 year ago

original-brownbear commented 1 year ago

We had this discussion before and it came up again today, so opening this issue to track it.

There are common use cases for Elasticsearch where Elasticsearch is executed on top of ephemeral disk storage that lives as long as the Elasticsearch process itself lives. Most notably this is the case for Cloud deployments. For these deployments fsync calls are meaningless, they do not provide any additional safety while costing considerably in terms of disk IO and at times also CPU, increasing tail latencies for indexing in particular.

I think we should offer the option to disable fsync at the node level. This wouldn't be too involved to implement since we have most fsync usage covered by Lucene directory methods or our IOUtils and would improve stability and performance in disk-bound use cases that don't require fsync.

I don't see any reason besides the potential for misuse + added complexity to not offer this but maybe there are reasons we didn't cover in past discussions?

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-distributed (Team:Distributed)

Tim-Brooks commented 1 year ago

There are a few things that @henningandersen and I discussed here. One central thing is the relationship between the "translog file" and the "checkpoint file". Currently we fsync the translog file first followed by the checkpoint file. And the checkpoint file determines the byte range in the translog file we will use on recovery.

If we don't fsync either we are concerned about the scenario where the checkpoint file is durably persisted, but the translog file is not. So after a restart we are presented with a scenario where the checkpoint file points to a partial translog file. Obviously we are specifically discussing this for environments where restarts do not occur. But our preference would still be to implement this in a way where there is no corruption. And our current durability options are ASYNC (in-memory) and REQUEST (full fsync durability). If we implement this in-between option of essentially "disk flush" it is possible this might be a reasonable alternative to ASYNC with more durability.

There are multiple options here (recover as many operations as possible, do not recover the translog file if the checkpoint does not align, etc). Just wanted to put this specific concern in text.