Break existing snapshot repo in S3 into multiple folders with one index per folder

balusarakesh commented 3 years ago

Problem statement:

We've been using curator helm chart to backup all indices for over a year to an S3 bucket folder named backups. This folder is very huge now, almost a few hundred TeraBytes.
The snapshot repository in ElasticSearch (named s3_backups) associated with this backups folder is also so huge that it takes a few minutes to load.
So, we created a new repository in a different folder. But the S3 costs to maintain this folder are high since the storage type is infrequent access, we want to move this to glacier or deep archive BUT if we ever want to retrieve a backup we have to retrieve the FULL backups folder which is not very cost-effective.

What we would like to do:

We would like to copy all the indices from the above s3_backups repository to their individual folders in S3. Example: Backups for index nginx-2021-10-14 will be put in a new folder named nginx-2021-10-14 in a snapshot repository called nginx-2021-10-14.
Once we create individual folders for each index separately we can move them to glacier or deep archive and if we ever want to retrieve them we can do so by retrieving a specific folder and we don't have to retrieve the whole s3_backups folder
I'm sure we ourselves can come up with a script that can create individual snapshot repos for each index but what we would like to know is how can we fix the existing s3_backups repository which holds the indices for the last one year
- Thank you for an awesome product

untergeek commented 3 years ago

I am so sorry to be the bearer of bad news, but this cannot work in the way you desire and describe—and this, not because of Curator, but because of how Elasticsearch works with snapshot repositories.

Inside the repository's root path you will find an indices folder, along with files named in patterns like meta-* and snap-*. Within the indices folder will be multiple paths named for the canonical index names known in the cluster metadata, and beneath these will be segment data and more metadata.

All of this means that you cannot break down a given snapshot by file, or by index by file. It cannot be done. At least, not without restoring them, and then subsequently re-snapshotting them to a different repository. Your massive, multi-hundred terabyte repository will need to be restored and re-created by degrees to create what you desire.

My recommendation is to start creating a repository per day, or week, or month, and then snapshot into that until the chosen time period is met or filled. After that, feel free to archive that entire path into Glacier or whatever else you might want to use. If you want to do the uncomfortable process of restore, reindex, and re-snapshot into smaller buckets, hopefully Curator can help.

A point to consider as you work towards very long-term cold/archival storage of snapshot data:

Elasticsearch updates the minimum acceptable read and write Lucene versions with each major release. We guarantee that each major version is capable of at least reading indices created by the previous major version.

This major version -1 pattern will continue for the foreseeable future. To put this into clearer focus, this means that if you have Elasticsearch 6.x indices in Glacier and you rehydrate them to S3, and then try to restore them into Elasticsearch 8.x, it will fail, because major version -1 means you could only restore an ES 6 index into Elasticsearch 6 or 7. We understand the difficulty this creates long-term archivers. At this point you can restore a 6.x index into ES 7, and then re-index it so it could be read by ES 8, and/or "update" by reindexing by degrees until the Lucene version is at a high enough level to be read by the current major release. This will be a manual process for now, but I hope to write a way to do this into Curator, extending its usefulness as ILM and SLM become the de facto standard for index and snapshot management.

balusarakesh commented 3 years ago

@unexpectedack Thank you for the reply, this makes sense

balusarakesh commented 2 years ago

FYI: in case anyone else wants to know, we used a very simple python script (due to proprietary reasons I'm not allowed to share the code here) which creates a new snapshot repo for every index in a specific path.

every index is created in it's own snapshot repo
indices that are rotated daily are placed in daily folder
indices that are rotated monthly are placed in monthly folder
we place lifecycle policies on those folders to move objects to GLACIER after 30 days

elastic / curator

Break existing snapshot repo in S3 into multiple folders with one index per folder #1625

Problem statement:

What we would like to do: