eyra / mono

Monorepo (used by Next Platform)
https://eyra.co
GNU Affero General Public License v3.0
7 stars 2 forks source link

Data cleanup for S3 storage #631

Open TjerkNan opened 4 months ago

TjerkNan commented 4 months ago

Is your feature request related to a problem? Please describe. We store data on public S3 and private S3 storage. What currently happens is that we add data, but we never delete data that may no longer be required / relevant. Frankly speaking, I don't know even by what heuristic we can determine what data would be eligible for cleanup.

Describe the solution you'd like Data that is eligible for cleanup is at some point removed (after 30 days or whatever timeframe)

Describe alternatives you've considered Just don't delete any data.

Additional context Impact: storage cost is low, $0,0245 per GB, but large volume of data may cause clutter when troubleshooting.
We have not yet decided if it's required to create backups of S3 data, which could also impact cost.

emielvdveen commented 4 months ago

@rowdyvl we need a definition for data that is not longer required / relevant. I made an effort to create at least an extra database table (content_files) that registers file references stored in S3. Any file reference in that table that is not used in any of the other tables can be considered irrelevant. I don't think we should even consider to remove data from S3 that is still referenced in a database record even if that object is archived or marked deleted.

We could then make a cron job to search for irrelevant file references and delete them from S3.