Open ErinWeisbart opened 1 year ago
I piloted this idea with https://cellpainting-gallery.s3.amazonaws.com/index.html#cpg0003-rosetta/broad/workspace/preprocessed_data/ (see https://www.nature.com/articles/s41592-022-01667-0#data-availability) but in retrospect, figshare would have been a better option for this use case.
My main concern is that publications and datasets may not align nicely. A single publication might have more than one dataset and vice versa and things can get confusing.
What do you think of using figshare instead? We did that with https://nih.figshare.com/articles/dataset/Cell_Health_-_Cell_Painting_Single_Cell_Profiles/9995672/1 (https://www.molbiolcell.org/doi/full/10.1091/mbc.E20-12-0784)
I am not terribly opposed to your proposal though and it's perfectly fine to pilot the idea and see where it goes.
Side note: We are piloting using cpg as a DVC backend but this will generally align well
I was thinking it was preferable to put everything we can in cpg instead of having elements distributed across tools (like figshare). Can you explain why you think figshare is a better option in general? I was only thinking of mappings of one to many publications to one dataset - I agree the above doesn't fit nicely for many datasets to one publication. Are you saying that Rosetta in general doesn't fit in cpg and that the whole project should have gone to figshare?
The current use case I'm thinking of is PERISCOPE and it would certainly work to use workspace/software/
as a DVC backend for the PERISCOPE paper repo and push large files that way. The nice thing about DVC is obviously that it versions to the files. The downside I see is, no matter how well we document our use of DVC in a repo, accessing the files from DVC and understanding that the .dvc files in the repo are just markers seems to be a barrier to new users that we have to explain repeatedly. So that is why I was thinking of not using DVC for the publication repo but instead storing the files "directly" in cpg.
I certainly don't think we should require publication data to go to cpg. But I guess that leaves my question for you at, given the above, would you be okay with PERISCOPE publication data going into a new workspace
folder? Would it be your strong preference we follow the format of the LINCS and JUMP-adipocyte pilots and add it as DVC files to workspace/software
?
I was thinking it was preferable to put everything we can in cpg instead of having elements distributed across tools (like figshare). Can you explain why you think figshare is a better option in general? I was only thinking of mappings of one to many publications to one dataset - I agree the above doesn't fit nicely for many datasets to one publication. Are you saying that Rosetta in general doesn't fit in cpg and that the whole project should have gone to figshare?
Thanks for bringing this up.
My thinking so far was that CPG should be as standardized as possible, and especially when it comes to data components that belong to publications, they should live elsewhere. Figshare was built precisely for this ("... where researchers can preserve and share their research outputs, including figures, datasets, images, and videos"). In fact, for our upcoming SCZ/Mito paper, we planned to deposit the files (including images) in figshare, not in CPG. This is because the dataset is too different from a typical CPG dataset (it is not Cell Painting, not in an arrayed format, etc.) and is small enough that it can be deposited in figshare. In retrospect, I'd have done the same with Rosetta, unless it was the case that there was a substantial benefit in saving the contents of the zip file individually.
Now having said all this, my perspective is shifting after previewing Quilt.
For example, with Rosetta, it's awesome that we can easily browse it because it lives on CPG! https://open.quiltdata.com/b/cellpainting-gallery/tree/cpg0003-rosetta/broad/workspace/preprocessed_data/CDRP-BBBC047-Bray/CellPainting/replicate_level_cp_augmented.csv.gz
Further, if we think in terms of Quilt packages in the future, we can afford to be more flexible about standardizing CPG's contents.
The current use case I'm thinking of is PERISCOPE and it would certainly work to use
workspace/software/
as a DVC backend for the PERISCOPE paper repo and push large files that way. The nice thing about DVC is obviously that it versions to the files. The downside I see is, no matter how well we document our use of DVC in a repo, accessing the files from DVC and understanding that the .dvc files in the repo are just markers seems to be a barrier to new users that we have to explain repeatedly. So that is why I was thinking of not using DVC for the publication repo but instead storing the files "directly" in cpg.
Agree about DVC. We use DVC for two reasons:
Quilt can do # 2, but it sounds like it might allow us to do # 1, and in that case, I am ok with piloting the idea of abandoning DVC in favor of directly storing data in S3 and leaving the versioning to Quilt.
I certainly don't think we should require publication data to go to cpg. But I guess that leaves my question for you at, given the above, would you be okay with PERISCOPE publication data going into a new
workspace
folder? Would it be your strong preference we follow the format of the LINCS and JUMP-adipocyte pilots and add it as DVC files toworkspace/software
?
I am good with PERISCOPE publication data going into a new workspace
folder because it will be a great way to pilot a new way of versioning data together with Quilt.
Thanks Shantanu. I'll leave this issue open as it seems that we have several datasets in different pilot modes and we can track the progress/good/bad here, but I'll move forward with piloting PERISCOPE publication data in a new workspace folder for use with Quilt.
No action is needed from you @ErinWeisbart; this is just FYI
I'll move forward with piloting PERISCOPE publication data in a new workspace folder for use with Quilt.
For our notes: you've done this here
We plan to do the same thing for upcoming papers
I will discuss this is separate threads and link back here.
Can cpg act as LFS for files generated in the publication of a dataset that are too large to be held in the publication repository itself? If so, what structure would we want?
I propose that yes, we allow (but not require) cpg to host publication-associated large files that are too large to fit directly into a publication repo. When compared to the size of the rest of any dataset, publication-associated files are unlikely to cause any noticeable change in total size. For any publication, I would expect we would include download instructions for accessing the raw data in cpg anyway, so hosting publication-associated files has the major benefit of simplifying access to those files as well.
I suggest a folder structure of:
@shntnu thoughts?