BD2KGenomics / s3am

A fast, parallel, streaming multipart uploader for S3
Other
13 stars 6 forks source link

--exists-overwrite does not overwrite? #57

Closed Jeltje closed 8 years ago

Jeltje commented 8 years ago

Using s3am (2.0a1.dev105) for download, s3am (2.0a1.dev99) for upload After moving bam files from a collaborator's inbox to our encrypted directory and running analysis, I noticed several truncated files. Asked the collaborator to re-upload, then ran s3am upload --exists overwrite to move the files. The sizes appeared to be the same between the two locations until I looked very closely with the aws client:

aws --region us-west-2 s3 ls --recursive --summarize cgl-driver-projects-encrypted | grep DTB-157-BL-T-DNA-HSmerge_S1.bam
2016-07-20 16:08:14 8769003598 ucsf-pnoc/ucsf_issue52_input/DTB-157-BL-T-DNA-HSmerge_S1.bam
aws --region us-west-2 s3 ls --recursive --summarize cgl-inbox-su2c-ucsf | grep DTB-157-BL-T-DNA-HSmerge_S1.bam
2016-07-20 17:51:47 8769087566 DTB-157-BL-T-DNA-HSmerge_S1.bam

And indeed, s3am download from the encrypted location results in the same truncated file as before (VERY truncated, actually, only chrs M and 1 have reads). Downloading directly from the inbox gives a complete file. The downloaded file sizes are exactly as listed on S3.

The exact copy command: s3am upload --exists overwrite --src-sse-key-file ./20160707.key --sse-key-file ./master.key --sse-key-is-master --download-slots 40 --upload-slots 40 --part-size 50M s3://cgl-inbox-su2c-ucsf/DTB-157-BL-T-DNA-HSmerge_S1.bam s3://cgl-driver-projects-encrypted/ucsf-pnoc/ucsf_issue52_input/DTB-157-BL-T-DNA-HSmerge_S1.bam Let me know if you want the collaborator key.

hannes-ucsc commented 8 years ago

I doubt this has to do with --overwrite. It's more likely a consistent bug in upload.

Could you delete the destination file from cgl-driver-projects-encrypted and retry the upload with --debug?

Jeltje commented 8 years ago

That corrects size (and so probably the issue):

s3am upload --debug --src-sse-key-file ./20160707.key --sse-key-file ./master.key --sse-key-is-master --download-slots 40 --upload-slots 40 --part-size 50M s3://cgl-inbox-su2c-ucsf/DTB-157-BL-T-DNA-HSmerge_S1.bam s3://cgl-driver-projects-encrypted/ucsf-pnoc/ucsf_issue52_input/DTB-157-BL-T-DNA-HSmerge_S1.bam 2>upload.err

http://hgwdev.cse.ucsc.edu/~jeltje/toil/upload.err

aws --region us-west-2 s3 ls --recursive --summarize cgl-driver-projects-encrypted | grep DTB-157-BL-T-DNA-HSmerge_S1.bam
2016-07-21 13:38:01 8769087566 ucsf-pnoc/ucsf_issue52_input/DTB-157-BL-T-DNA-HSmerge_S1.bam

Not sure what this tells you about the attempt to overwrite?

hannes-ucsc commented 8 years ago

Not sure what this tells you about the attempt to overwrite?

Nothing. Just trying to rule out the possibility of a truncation issue with upload.

Is it possible that the 8769003598 version of that file in the cgl-driver-projects-encrypted bucket was produced using either a different key or without --sse-key-is-master?

Jeltje commented 8 years ago

Nope. Also, that wouldn not produce anything I could download later, I think. Bam file goes in, truncated bam file comes out. I can do another --exists overwrite with --debug on a file I did not delete, if that helps?

hannes-ucsc commented 8 years ago

Yes, that'd be great. Can you run

# upload local file
s3am upload --exists overwrite --debug --sse-key-file ./master.key --sse-key-is-master --download-slots 40 --upload-slots 40 --part-size 50M file:///some/local/file s3://cgl-driver-projects-encrypted/ucsf-pnoc/ucsf_issue52_input/DTB-157-BL-T-DNA-HSmerge_S1.bam

aws --region us-west-2 s3 ls --recursive --summarize cgl-driver-projects-encrypted | grep DTB-157-BL-T-DNA-HSmerge_S1.bam

# copy file from inbox again
s3am upload --exists overwrite --debug --src-sse-key-file ./20160707.key --sse-key-file ./master.key --sse-key-is-master --download-slots 40 --upload-slots 40 --part-size 50M s3://cgl-inbox-su2c-ucsf/DTB-157-BL-T-DNA-HSmerge_S1.bam s3://cgl-driver-projects-encrypted/ucsf-pnoc/ucsf_issue52_input/DTB-157-BL-T-DNA-HSmerge_S1.bam

aws --region us-west-2 s3 ls --recursive --summarize cgl-driver-projects-encrypted | grep DTB-157-BL-T-DNA-HSmerge_S1.bam

Just add output redirection as necessary.

Jeltje commented 8 years ago

That overwrites the file correctly first upload: 2016-07-21 19:42:27 11724 ucsf-pnoc/ucsf_issue52_input/DTB-157-BL-T-DNA-HSmerge_S1.bam second upload: 2016-07-21 19:46:18 8769087566 ucsf-pnoc/ucsf_issue52_input/DTB-157-BL-T-DNA-HSmerge_S1.bam

I tried to overwrite one of the other trouble files using the exact command I used before (same bash script), and it appears to work fine.

In other words, can't reproduce the issue.