bloomreach / s4cmd

Super S3 command line tool
Apache License 2.0
1.37k stars 211 forks source link

Check md5 after downloading from S3 #90

Open Schmed opened 7 years ago

Schmed commented 7 years ago

We've recently found a few local files downloaded from S3 using s4cmd get --sync-check that were corrupt. Retrying the same download using a separate s4cmd invocation resolved the problem (and we have seen the problem on two completely separate, but similarly configured EC2 instances). We were using version 2.0.1.

Since this command already leverages the MD5 hash saved in the S3 metadata (even, apparently, for multi-part S3 objects) it's amazing that the MD5 is not automatically validated against the local copy after the download completes. Although computing the MD5 on even a large local file is fairly quick (given a reasonably powerful system), you could always provide an option to skip such a check in the interest of performance. Ideally, a failed check would be logged and then the download retried (at least --retry times).

Schmed commented 7 years ago

Obviously, a similar check could be performed after an s4cmd put to ensure the integrity of the resulting S3 object.