broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

Update DVC location #86

Closed shntnu closed 2 years ago

shntnu commented 2 years ago

I ran this command to copy the DVC files to a new location

aws s3 sync   \
  --profile jump-cp-role  \
  --acl bucket-owner-full-control  \
  --metadata-directive REPLACE \
  s3://cellpainting-datasets/lincs-cell-painting/.dvc/cache/ \
  s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/software/lincs-cell-painting_DVC/

The rest of the files (other than .dvc/cache) in s3://cellpainting-datasets/lincs-cell-painting/ seem to be a git clone of this repo (see below)

image

Note: At present, I still don't have a good way to provide permissions to push to this new location (i.e. unlike with s3://cellpainting-datasets/ where I could give @gwaybio AWS credentials to push to the S3 bucket, I can't do that yet with s3://cellpainting-gallery/.)

If that's fine with you @gwaybio, please merge this PR (and then I will delete s3://cellpainting-datasets/lincs-cell-painting/)

shntnu commented 2 years ago
  • Have you tested dvc pull locally using the new aws location?

Yep; I get the same counts/sizes for both locations

$ find profiles/2016_04_01_a549_48hr_batch1|wc -l
     953                                                                                                                                                                                                                                     

$ find profiles/2017_12_05_Batch2|wc -l
     939                                                                                                                                                                                                                                     

$ du -depth=0  profiles/2016_04_01_a549_48hr_batch1
2514952 profiles/2016_04_01_a549_48hr_batch1                                                                                                                                                                                                 

$ du -depth=0  profiles/2017_12_05_Batch2
3041720 profiles/2017_12_05_Batch2                                                                                                                                                                                                           
  • Ensuring the full repo in the aws bucket is in sync with github updates requires an extra manual step . (This is fine, I just think the github repo should remain the source of truth)

One does not need to do this, right? i.e. there is no need to clone the full repo in the bucket. For clarify, the only folder that needs to exist is s3://cellpainting-datasets/lincs-cell-painting/.dvc/cache, which I have now synced to a new location