DataBiosphere / ssds

Simple data storage system for AWS and GCP
MIT License
2 stars 1 forks source link

Speed Up Transfer For Backups #148

Closed juklucas closed 3 years ago

juklucas commented 3 years ago

I tried to use:

ssds staging upload \
    --deployment default \
    --submission-id 809dd888-fe56-4535-8cc8-1121f379129c \
    --name WUSTL_OTHER_HiFi_w_SUBREADS \
    s3://human-pangenomics/submissions/8fa7bde9-be6f-4160-97a9-b639a8962c66--WUSTL_OTHER_HiFi/ 

But it took >12h to upload a single *.subreads.bam file (~750GB). When I use AWS CLI to copy 3TB of similar data, it takes about 2 hours.

This use case was specific, but will likely happen again: we have ~30 samples that have ~3 TB each of subreads files that we want to preserve, but no one uses. So we would like to back them up to outside of the submission area. That way we don't have to pay for storing this data in Terra/GCP, but we still retain the data.

It would be nice to issue a command like:

ssds staging general-cp \
    s3://dest/path \
    s3://src_path 

Where the copy command does not recalculate checksums.

I have listed the command as cp not mv because we will want to backup the entire directory then delete the huge files in place. This is easier/simpler than moving specific files -- and there will likely be instances where copying is actually needed (as may be the case for creating "releases").

xbrianh commented 3 years ago

closed via #170