elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.6k stars 24.63k forks source link

Doc: Clarify split index disk usage requirements #88190

Open ppf2 opened 2 years ago

ppf2 commented 2 years ago

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-split-index.html

On disk usage requirement, our current documentation simply mentions:

The node handling the split process must have sufficient free disk space to accommodate a second copy of the existing index.

But it is possible for the target index to use more disk space than the size of the original index because splitting is basically cloning the shards and deleting documents from them, which means that there will be deleted documents for the Lucene merge process to clean up. Until these segments are organically merged over time (assuming that the index will continue to have indexing activity over time) or via a force merge, they can take up more space than the original index.

Also, splitting to a large number of shards means more disk space overhead (that increases with a larger # of shards) in order to accommodate shared structures like term dictionary across the shards.

While it may be difficult to provide a formula for exactly how much more disk space is required, it will be helpful to document the above caveats. Certainly, if there's a ballmark estimate we can provide for those who just want to be on the safe side, it will be helpful as well (e.g. will it be sufficient to accommodate all the unmerged segments and additional overhead if we recommend having 3 times the space of the original index)?

Thanks!

elasticmachine commented 2 years ago

Pinging @elastic/es-docs (Team:Docs)

elasticmachine commented 2 years ago

Pinging @elastic/es-distributed (Team:Distributed)

ppf2 commented 2 years ago

This is related to the undocumented quota aware file system limitation in Elasticsearch for Split Index API: https://github.com/elastic/elasticsearch/pull/88822

DaveCTurner commented 2 years ago

I've moved this to the allocation area because I think it'd be better to make these checks automatically rather than to simply document some complex formula that users may or may not heed.