Closed balusarakesh closed 3 years ago
I am so sorry to be the bearer of bad news, but this cannot work in the way you desire and describe—and this, not because of Curator, but because of how Elasticsearch works with snapshot repositories.
Inside the repository's root path you will find an indices
folder, along with files named in patterns like meta-*
and snap-*
. Within the indices
folder will be multiple paths named for the canonical index names known in the cluster metadata, and beneath these will be segment data and more metadata.
All of this means that you cannot break down a given snapshot by file, or by index by file. It cannot be done. At least, not without restoring them, and then subsequently re-snapshotting them to a different repository. Your massive, multi-hundred terabyte repository will need to be restored and re-created by degrees to create what you desire.
My recommendation is to start creating a repository per day, or week, or month, and then snapshot into that until the chosen time period is met or filled. After that, feel free to archive that entire path into Glacier or whatever else you might want to use. If you want to do the uncomfortable process of restore, reindex, and re-snapshot into smaller buckets, hopefully Curator can help.
A point to consider as you work towards very long-term cold/archival storage of snapshot data:
Elasticsearch updates the minimum acceptable read and write Lucene versions with each major release. We guarantee that each major version is capable of at least reading indices created by the previous major version.
This major version -1 pattern will continue for the foreseeable future. To put this into clearer focus, this means that if you have Elasticsearch 6.x indices in Glacier and you rehydrate them to S3, and then try to restore them into Elasticsearch 8.x, it will fail, because major version -1 means you could only restore an ES 6 index into Elasticsearch 6 or 7. We understand the difficulty this creates long-term archivers. At this point you can restore a 6.x index into ES 7, and then re-index it so it could be read by ES 8, and/or "update" by reindexing by degrees until the Lucene version is at a high enough level to be read by the current major release. This will be a manual process for now, but I hope to write a way to do this into Curator, extending its usefulness as ILM and SLM become the de facto standard for index and snapshot management.
@unexpectedack Thank you for the reply, this makes sense
FYI: in case anyone else wants to know, we used a very simple python script (due to proprietary reasons I'm not allowed to share the code here) which creates a new snapshot repo for every index in a specific path.
daily
foldermonthly
folderGLACIER
after 30 days
Problem statement:
backups
. This folder is very huge now, almost a few hundred TeraBytes.s3_backups
) associated with thisbackups
folder is also so huge that it takes a few minutes to load.infrequent access
, we want to move this to glacier or deep archive BUT if we ever want to retrieve a backup we have to retrieve the FULLbackups
folder which is not very cost-effective.What we would like to do:
s3_backups
repository to their individual folders in S3. Example: Backups for indexnginx-2021-10-14
will be put in a new folder namednginx-2021-10-14
in a snapshot repository callednginx-2021-10-14
.glacier
ordeep archive
and if we ever want to retrieve them we can do so by retrieving a specific folder and we don't have to retrieve the wholes3_backups
folderI'm sure we ourselves can come up with a script that can create individual snapshot repos for each index but what we would like to know is how can we fix the existing
s3_backups
repository which holds the indices for the last one year