Closed gwaybio closed 3 years ago
- [ ] Can you provide me with access (and a pointer) to which AWS bucket to use for permanent dvc storage and access?
The bucket reference here is the one I had in mind
https://registry.opendata.aws/cell-painting-image-collection/
I will go pull up stuff now
Hm – the only trouble is that the bucket is called cytodata
, which will be a bit odd. Crud. It's on my plate to create a new AWS Open Data Resource, but I don't have an ETA.
To unblock you, I'd suggest we go ahead with depositing it at s3://cellpainting-datasets
instead. You already have credentials for that (same as our primary AWS account)
There's some chance we may need to change that, but at least it will keep you moving.
IIUC the change is not too hard
and you'd only need to modify the URL.
The file pointer will remain the same as long as we keep the relative paths the same, and don't modify the file (otherwise md5 will change)
https://github.com/cytomining/profiling-template/issues/13#issue-837044642
I am planning on adding all level 3-4 data to dvc, but keep level 5 and spherized profiles as git lfs files. We use the level 3-4 data less frequently, and we often read the level 5 and spherized profiles directly from their github urls.
While dvc also has a nifty way of directly interacting with dvc files from github urls in python, it is not a direct drop-in solution for reading directly from url.
We get the best of both worlds having the lower level profiles on s3 and the more interactive files versioned through git lfs.
We get the best of both worlds having the lower level profiles on s3 and the more interactive files versioned through git lfs.
Nice plan!
For my notes:
You can't directly access the a DVC-versioned file via URL because the pointer looks like this https://github.com/gwaygenomics/grit-benchmark/blob/6b826a03456b5e0d6437aff99e17a407653c2568/1.calculate-metrics/cell-health/results/cell_health_grit_compartments.tsv.gz.dvc
Whereas you can directly access the files via URL for GitLFS https://github.com/gwaygenomics/grit-benchmark/blob/main/0.download-data/data/ceres.csv (click on "View raw")
BTW for level 5 + spherized – I am guessing it isn't practical to have them live in both, DVC and GitLFS? I ask because it will be convenient to be able to get all the data from the bucket alone if one would like to do so.
We needn't do that for this dataset, but I was just wondering if there's any path that will allow us to do so in the future.
My plan is to add this whole repo to the S3 bucket - we'll be able to access dvc files from where they live naturally, and we'll be able to access git lfs files via bucket or media url
My plan is to add this whole repo to the S3 bucket - we'll be able to access dvc files from where they live naturally, and we'll be able to access git lfs files via bucket or media url
Oh, interesting – curious you see what you mean by "adding the repo to the S3 bucket"; I can wait for the PR, no need to explain right now
I am working on this now.
Asks
@shntnu
Cross references
A couple cross-references to track history of DVC discussions: