ccodwg / Covid19CanadaArchive

Canadian COVID-19 Data Archive
https://opencovid.ca
Other
22 stars 10 forks source link

Consider not uploading files with identical checksums #257

Closed jeanpaulrsoucy closed 1 year ago

jeanpaulrsoucy commented 2 years ago

In the future, should consider not storing duplicate copies of files when the checksums are identical. Could log when files are not saved because checksums are identical to distinguish from cases when files were not saved due to a program or website issue.

Etags have issues with larger files since the Etags of multi-part uploads will not match locally calculated MD5 hashes. S3 has new options regarding calculating and requesting checksums: https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/

Of course, the alternative is to use the API to request the checksum of the most recent recent file, which would be simpler but less generalizable.

jeanpaulrsoucy commented 1 year ago

Implemented with #267.