broadinstitute / cellpainting-gallery

Cell Painting Gallery
https://broadinstitute.github.io/cellpainting-gallery/
MIT License
51 stars 8 forks source link

Structure for publication-associated data? #31

Open ErinWeisbart opened 1 year ago

ErinWeisbart commented 1 year ago

Can cpg act as LFS for files generated in the publication of a dataset that are too large to be held in the publication repository itself? If so, what structure would we want?

I propose that yes, we allow (but not require) cpg to host publication-associated large files that are too large to fit directly into a publication repo. When compared to the size of the rest of any dataset, publication-associated files are unlikely to cause any noticeable change in total size. For any publication, I would expect we would include download instructions for accessing the raw data in cpg anyway, so hosting publication-associated files has the major benefit of simplifying access to those files as well.

I suggest a folder structure of:

└── workspace
           └── publication_data
                      └── YEAR_FIRSTAUTHOR
                                  ├── large_file_example1.csv.gz
                                  ├── large_file_example2.csv.gz
                                  └── large_file_example3.csv.gz

@shntnu thoughts?

shntnu commented 1 year ago

I piloted this idea with https://cellpainting-gallery.s3.amazonaws.com/index.html#cpg0003-rosetta/broad/workspace/preprocessed_data/ (see https://www.nature.com/articles/s41592-022-01667-0#data-availability) but in retrospect, figshare would have been a better option for this use case.

My main concern is that publications and datasets may not align nicely. A single publication might have more than one dataset and vice versa and things can get confusing.

What do you think of using figshare instead? We did that with https://nih.figshare.com/articles/dataset/Cell_Health_-_Cell_Painting_Single_Cell_Profiles/9995672/1 (https://www.molbiolcell.org/doi/full/10.1091/mbc.E20-12-0784)

I am not terribly opposed to your proposal though and it's perfectly fine to pilot the idea and see where it goes.


Side note: We are piloting using cpg as a DVC backend but this will generally align well

ErinWeisbart commented 1 year ago

I was thinking it was preferable to put everything we can in cpg instead of having elements distributed across tools (like figshare). Can you explain why you think figshare is a better option in general? I was only thinking of mappings of one to many publications to one dataset - I agree the above doesn't fit nicely for many datasets to one publication. Are you saying that Rosetta in general doesn't fit in cpg and that the whole project should have gone to figshare?

The current use case I'm thinking of is PERISCOPE and it would certainly work to use workspace/software/ as a DVC backend for the PERISCOPE paper repo and push large files that way. The nice thing about DVC is obviously that it versions to the files. The downside I see is, no matter how well we document our use of DVC in a repo, accessing the files from DVC and understanding that the .dvc files in the repo are just markers seems to be a barrier to new users that we have to explain repeatedly. So that is why I was thinking of not using DVC for the publication repo but instead storing the files "directly" in cpg.

I certainly don't think we should require publication data to go to cpg. But I guess that leaves my question for you at, given the above, would you be okay with PERISCOPE publication data going into a new workspace folder? Would it be your strong preference we follow the format of the LINCS and JUMP-adipocyte pilots and add it as DVC files to workspace/software?

shntnu commented 1 year ago

I was thinking it was preferable to put everything we can in cpg instead of having elements distributed across tools (like figshare). Can you explain why you think figshare is a better option in general? I was only thinking of mappings of one to many publications to one dataset - I agree the above doesn't fit nicely for many datasets to one publication. Are you saying that Rosetta in general doesn't fit in cpg and that the whole project should have gone to figshare?

Thanks for bringing this up.

My thinking so far was that CPG should be as standardized as possible, and especially when it comes to data components that belong to publications, they should live elsewhere. Figshare was built precisely for this ("... where researchers can preserve and share their research outputs, including figures, datasets, images, and videos"). In fact, for our upcoming SCZ/Mito paper, we planned to deposit the files (including images) in figshare, not in CPG. This is because the dataset is too different from a typical CPG dataset (it is not Cell Painting, not in an arrayed format, etc.) and is small enough that it can be deposited in figshare. In retrospect, I'd have done the same with Rosetta, unless it was the case that there was a substantial benefit in saving the contents of the zip file individually.

Now having said all this, my perspective is shifting after previewing Quilt.

For example, with Rosetta, it's awesome that we can easily browse it because it lives on CPG! https://open.quiltdata.com/b/cellpainting-gallery/tree/cpg0003-rosetta/broad/workspace/preprocessed_data/CDRP-BBBC047-Bray/CellPainting/replicate_level_cp_augmented.csv.gz

Further, if we think in terms of Quilt packages in the future, we can afford to be more flexible about standardizing CPG's contents.

The current use case I'm thinking of is PERISCOPE and it would certainly work to use workspace/software/ as a DVC backend for the PERISCOPE paper repo and push large files that way. The nice thing about DVC is obviously that it versions to the files. The downside I see is, no matter how well we document our use of DVC in a repo, accessing the files from DVC and understanding that the .dvc files in the repo are just markers seems to be a barrier to new users that we have to explain repeatedly. So that is why I was thinking of not using DVC for the publication repo but instead storing the files "directly" in cpg.

Agree about DVC. We use DVC for two reasons:

  1. we can tie data to code
  2. we can version data

Quilt can do # 2, but it sounds like it might allow us to do # 1, and in that case, I am ok with piloting the idea of abandoning DVC in favor of directly storing data in S3 and leaving the versioning to Quilt.

I certainly don't think we should require publication data to go to cpg. But I guess that leaves my question for you at, given the above, would you be okay with PERISCOPE publication data going into a new workspace folder? Would it be your strong preference we follow the format of the LINCS and JUMP-adipocyte pilots and add it as DVC files to workspace/software?

I am good with PERISCOPE publication data going into a new workspace folder because it will be a great way to pilot a new way of versioning data together with Quilt.

ErinWeisbart commented 1 year ago

Thanks Shantanu. I'll leave this issue open as it seems that we have several datasets in different pilot modes and we can track the progress/good/bad here, but I'll move forward with piloting PERISCOPE publication data in a new workspace folder for use with Quilt.

shntnu commented 2 months ago

No action is needed from you @ErinWeisbart; this is just FYI


I'll move forward with piloting PERISCOPE publication data in a new workspace folder for use with Quilt.

For our notes: you've done this here

https://cellpainting-gallery.s3.amazonaws.com/index.html#cpg0021-periscope/broad/workspace/publication_data/2022_PERISCOPE/

We plan to do the same thing for upcoming papers

I will discuss this is separate threads and link back here.