dandi / dandi-infrastructure

A repository to collect docs/issues on DANDI project infrastructure
Apache License 2.0
0 stars 6 forks source link

GC old inventory listings #198

Open yarikoptic opened 1 week ago

yarikoptic commented 1 week ago

As "discovered" in

we might not really need historical records of inventory to achieve a "full backup" of S3. Inventory dumps themselves are quite large! I am still fetching (to facilitate analysis etc, but might stop doing that) and so far fetched 14TB. As such, it is a notable amount of storage . Here is how they grew through the years (per day)

(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandiarchive-inventory$ code/print-manifest-summary dump/202*-01-01T*/manifest.json
dump/2020-01-01T00-00Z/manifest.json : 1 entries,   197K total size
dump/2021-01-01T00-00Z/manifest.json : 1 entries,    3.8M total size
dump/2022-01-01T00-00Z/manifest.json : 1 entries,      17M total size
dump/2023-01-01T01-00Z/manifest.json : 384 entries,         36G total size
dump/2024-01-01T01-00Z/manifest.json : 406 entries,         38G total size

and this year grew to 39G per day(!) which would amount 14TB per year just for the dumps (so I expect to fetch then 40TB... may be should interrupt and fetch specific days and their data only).

Mostly it is due to all the zarr/s. But it remains the case that we might want to prune some old inventory listings soonish. (attn @satra with whom we briefly discussed some bucket GCing to do)