data-skeptic / home-data-gallery

A place for people to send pull requests for interesting examples they'd like to share
11 stars 8 forks source link

Research S3 auto-expire #35

Open kylepolich opened 8 years ago

kylepolich commented 8 years ago

Our crawling system uses S3 for storage

S3 offers some functionality to have objects expire. Our code sets expiration times for all objects, but it enforces these manually.

We are using the Python boto library do interact with S3.

Please research how to use S3's native expiration functionality. Specifically: how can we use boto to set an expiration and have AWS do the deletions for us?

This may help get you started?

https://aws.amazon.com/blogs/aws/amazon-s3-object-expiration/

tomhoag commented 7 years ago

I just went through the AWS docs on this recently for another project of mine.

Tell me more about how objects are being created and the buckets that they are being stored in?

With AWS S3, you can setup up expiration and transition rules on a bucket. The rules use object prefixes or tags to determine which objects should be acted upon. The rules are time based using the object creation time. (It gets a bit more involved if the objects are versioned).

If the boto code is creating buckets and putting objects into the buckets, it would probably be best to take a closer look at the AWS S3 python library to see if it supports S3 rule creation. If there is a small number of static buckets that are used day in and day out for the storage of web scrapings, it might be easier to use the AWS console to create the rules.

In the later case, the only possible change would be standardizing the object prefixes/tags that boto is using so the the rules don't have to be overly complicated.

One other thought, rules can also be used to transition S3 objects into low cost, slow access AWS Glacier storage. If there's any thought that someday the scraped data might be useful, it may be worthwhile to transition it to glacier before deleting from S3.