freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
549 stars 151 forks source link

Restore financial disclosures from S3 deep glacier #1658

Closed mlissner closed 3 years ago

mlissner commented 3 years ago

The AWS console makes it very easy, cheap, and fast to move things to deep glacier storage, but makes it very hard, expensive, and slow to restore them. I made the mistake on Wednesday of moving a directory that I thought only had old data into deep glacier.

The process for restoring it has been annoying. First, you have to make restore requests for every object. This is weirdly slow going, and you have to say how long you want it to be restored for. The command I used for that was:

s3cmd restore --recursive s3://$my_bucket/us/federal

This took several hours to run. There are guides online about parallelizing it, but I didn't bother. I also forgot to set the --restore-days parameter, which sets how long you want the restored data to stick around for. The default number of days isn't documented.

Then you wait up to 12 hours as AWS does the restore. Fine.

Finally, the items are accessible again, BUT it's only a copy that's around temporarily. To make that permanent, you have to copy it in place with something like:

aws s3 cp s3://$my_bucket/us/federal  s3://$my_bucket/us/federal --force-glacier-transfer --storage-class STANDARD --recursive --profile storage
mlissner commented 3 years ago

Oh, I also saw some hints about s5cmd being faster, written in Go, etc. Might be worth checking out in a pinch next time.

mlissner commented 3 years ago

I heard from colleagues today that the files were not publicly accessible. It looks like this is because the cp command resets their permissions. Annoying, but I was able to easily fix this via the console. I haven't double checked this at all, but I believe in the future I should use --acl public-read when doing the cp to prevent this.

I also found that a useful command for testing things is something like:

aws s3api list-objects \
    --bucket  'xxx' \
    --prefix us/federal/judicial/financial-disclosures/  \
    --query 'Contents[?StorageClass!=`STANDARD`]'

That'll show you files in a bucket/prefix that aren't STANDARD storage class. There are some additional tips here too.