elastic / curator

Curator: Tending your Elasticsearch indices
Other
3.04k stars 634 forks source link

Break existing snapshot repo in S3 into multiple folders with one index per folder #1625

Closed balusarakesh closed 3 years ago

balusarakesh commented 3 years ago

Problem statement:

What we would like to do:

untergeek commented 3 years ago

I am so sorry to be the bearer of bad news, but this cannot work in the way you desire and describe—and this, not because of Curator, but because of how Elasticsearch works with snapshot repositories.

Inside the repository's root path you will find an indices folder, along with files named in patterns like meta-* and snap-*. Within the indices folder will be multiple paths named for the canonical index names known in the cluster metadata, and beneath these will be segment data and more metadata.

All of this means that you cannot break down a given snapshot by file, or by index by file. It cannot be done. At least, not without restoring them, and then subsequently re-snapshotting them to a different repository. Your massive, multi-hundred terabyte repository will need to be restored and re-created by degrees to create what you desire.

My recommendation is to start creating a repository per day, or week, or month, and then snapshot into that until the chosen time period is met or filled. After that, feel free to archive that entire path into Glacier or whatever else you might want to use. If you want to do the uncomfortable process of restore, reindex, and re-snapshot into smaller buckets, hopefully Curator can help.

A point to consider as you work towards very long-term cold/archival storage of snapshot data:

Elasticsearch updates the minimum acceptable read and write Lucene versions with each major release. We guarantee that each major version is capable of at least reading indices created by the previous major version.

This major version -1 pattern will continue for the foreseeable future. To put this into clearer focus, this means that if you have Elasticsearch 6.x indices in Glacier and you rehydrate them to S3, and then try to restore them into Elasticsearch 8.x, it will fail, because major version -1 means you could only restore an ES 6 index into Elasticsearch 6 or 7. We understand the difficulty this creates long-term archivers. At this point you can restore a 6.x index into ES 7, and then re-index it so it could be read by ES 8, and/or "update" by reindexing by degrees until the Lucene version is at a high enough level to be read by the current major release. This will be a manual process for now, but I hope to write a way to do this into Curator, extending its usefulness as ILM and SLM become the de facto standard for index and snapshot management.

balusarakesh commented 3 years ago

@unexpectedack Thank you for the reply, this makes sense

balusarakesh commented 2 years ago

FYI: in case anyone else wants to know, we used a very simple python script (due to proprietary reasons I'm not allowed to share the code here) which creates a new snapshot repo for every index in a specific path.