broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

Adding profiles to dvc #66

Closed gwaybio closed 3 years ago

gwaybio commented 3 years ago

I am working on this now.

Asks

@shntnu

Cross references

A couple cross-references to track history of DVC discussions:

shntnu commented 3 years ago
  • [ ] Can you provide me with access (and a pointer) to which AWS bucket to use for permanent dvc storage and access?

The bucket reference here is the one I had in mind

https://registry.opendata.aws/cell-painting-image-collection/

I will go pull up stuff now

shntnu commented 3 years ago

Hm – the only trouble is that the bucket is called cytodata, which will be a bit odd. Crud. It's on my plate to create a new AWS Open Data Resource, but I don't have an ETA.

To unblock you, I'd suggest we go ahead with depositing it at s3://cellpainting-datasets instead. You already have credentials for that (same as our primary AWS account)

There's some chance we may need to change that, but at least it will keep you moving.

IIUC the change is not too hard

https://github.com/gwaygenomics/grit-benchmark/blob/a04d010b2f579d5dd0cfdc2c9222c2d7f02b9a84/.dvc/config#L4

and you'd only need to modify the URL.

The file pointer will remain the same as long as we keep the relative paths the same, and don't modify the file (otherwise md5 will change)

https://github.com/cytomining/profiling-template/issues/13#issue-837044642

gwaybio commented 3 years ago

I am planning on adding all level 3-4 data to dvc, but keep level 5 and spherized profiles as git lfs files. We use the level 3-4 data less frequently, and we often read the level 5 and spherized profiles directly from their github urls.

While dvc also has a nifty way of directly interacting with dvc files from github urls in python, it is not a direct drop-in solution for reading directly from url.

We get the best of both worlds having the lower level profiles on s3 and the more interactive files versioned through git lfs.

shntnu commented 3 years ago

We get the best of both worlds having the lower level profiles on s3 and the more interactive files versioned through git lfs.

Nice plan!

For my notes:

You can't directly access the a DVC-versioned file via URL because the pointer looks like this https://github.com/gwaygenomics/grit-benchmark/blob/6b826a03456b5e0d6437aff99e17a407653c2568/1.calculate-metrics/cell-health/results/cell_health_grit_compartments.tsv.gz.dvc

Whereas you can directly access the files via URL for GitLFS https://github.com/gwaygenomics/grit-benchmark/blob/main/0.download-data/data/ceres.csv (click on "View raw")

shntnu commented 3 years ago

BTW for level 5 + spherized – I am guessing it isn't practical to have them live in both, DVC and GitLFS? I ask because it will be convenient to be able to get all the data from the bucket alone if one would like to do so.

We needn't do that for this dataset, but I was just wondering if there's any path that will allow us to do so in the future.

gwaybio commented 3 years ago

My plan is to add this whole repo to the S3 bucket - we'll be able to access dvc files from where they live naturally, and we'll be able to access git lfs files via bucket or media url

shntnu commented 3 years ago

My plan is to add this whole repo to the S3 bucket - we'll be able to access dvc files from where they live naturally, and we'll be able to access git lfs files via bucket or media url

Oh, interesting – curious you see what you mean by "adding the repo to the S3 bucket"; I can wait for the PR, no need to explain right now