broadinstitute / cellpainting-gallery

Cell Painting Gallery
https://broadinstitute.github.io/cellpainting-gallery/
MIT License
58 stars 11 forks source link

2022_09_DD_DeepProfiler (cpg0019) #20

Closed shntnu closed 12 months ago

shntnu commented 1 year ago

Segmentation/ Feature extraction is being performed by (Cimini lab / Carpenter-Singh lab)
Profile creation is being performed by (Cimini lab / Carpenter-Singh lab)
Data can be public in RODA Immediately

Update as generated:
[Link to profile repo]
https://doi.org/10.1101/2022.08.12.503783 cpg0019-moshkov-deepprofiler

Transfer to CellPainting Gallery:

If data is being published, prepare for publication: These are only training images (crops)

Once published:

shntnu commented 1 year ago
cellpainting-gallery
└── cpg0019-moshkov-deepprofiler
    └──broad
        └── training_images
        │   ├── TAORF
        │   │   └── images
        │   │       └── <plate-id>
        │   │           └── <well>
        │   │              └── <site>
        │   │                  ├── <cell1>
        │   │                  ├── <cell2>
        │   │                  ├── ...
        │   │                  └── <celln>
        │   ├── BBBC022
        │   ├── LUAD
        │   ├── CDRP
        │   └── LINCS
        └── workspace
Arkkienkeli commented 1 year ago

Hi Shantanu, I have this folder structure now, any concerns or suggestions?

cpg0019-moshkov-deepprofiler
└── broad
    ├── training_images
    │   ├── BBBC022
    │   │   ├── A01
    │   │       ├── 1
    │   │           ├── *.png
    │   ├── BBBC036
    │   ├── BBBC037
    │   ├── BBBC043
    │   └── LINCS
    └── workspace_dl
        ├── collated
        │   └── 105281_zenodo7114558
        │       ├── BBBC022
        │       │   ├── notspherized.csv
        │       │   └── spherized.csv
        │       ├── BBBC036
        │       │   ├── notspherized.csv
        │       │   └── spherized.csv
        │       └── BBBC037
        │           ├── notspherized.csv
        │           └── spherized.csv
        ├── consensus
        │   └── 105281_zenodo7114558
        │       ├── BBBC022
        │       │   ├── notspherized.csv
        │       │   └── spherized.csv
        │       ├── BBBC036
        │       │   ├── notspherized.csv
        │       │   └── spherized.csv
        │       └── BBBC037
        │           ├── notspherized.csv
        │           └── spherized.csv
        ├── embeddings
        │   └── 105281_zenodo7114558
        │       ├── BBBC022
        │       │   ├── 20585
        │       │       ├── A01
        │       │           ├── 1
        │       │               ├── embedding.npz
        │       ├── BBBC036
        │       └── BBBC037
        └── metadata
            ├── BBBC022_profiling.csv
            ├── BBBC036_profiling.csv
            ├── BBBC037_profiling.csv
            └── sc-metadata.csv
shntnu commented 1 year ago

Looks great @Arkkienkeli!

I've modified https://github.com/broadinstitute/cellpainting-gallery/commit/c1412d605870350f512d093e229f39d0cd20c82a to reflect this

Arkkienkeli commented 1 year ago

Hi @shntnu and @ErinWeisbart, is the dataset available in the gallery?

shntnu commented 1 year ago

It is 🎉

shntnu commented 1 year ago

@Arkkienkeli Are you happy with this one-liner to summarize cpg0019?

8.3 million single cells from 232 plates, across 488 treatments from 5 public datasets, used for learning representations

Feel free to edit #27 if not.

shntnu commented 1 year ago

I'm adding our email logs here for our records

Forwarded Conversation Subject: Posting dataset in the AWS Cell Painting Gallery ------------------------ From: Juan Caicedo Hi Shantanu, We'd like to make the combined dataset of single cells that Nikita created for training publicly available as part of the materials that will support the submission of the DeepProfiler paper. Can you guide us on how to do this? The dataset takes about 200GB of space. If this is not the best resource for the dataset, do you have any other suggestions for making it public? Thank you! Juan C. ---------- From: Shantanu Singh Hi Juan The gallery sounds like the right place to store this information (I'd imagine you also want to store the corresponding images, and not just the single cell data, correct?) We have a process for doing this https://github.com/broadinstitute/cellpainting-gallery#contributing-to-cell-painting-gallery I have got us started here https://github.com/broadinstitute/cellpainting-gallery/issues/20 It's worth your skimming the folder structure https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md to see how things are organized In your case, we will need to skip a bunch of folders (I'd imagine), but we can figure that out later The first thing to figure out is: where in the structure do we store embeddings? This will be the first such dataset, so it will be good to think this through, and your inputs would be great. This is what I proposed https://github.com/broadinstitute/cellpainting-gallery/pull/19/files after chatting with Mike Ando, who will also be producing embeddings (for JUMP). Let me know what you think (either in the PR or here) Once we settle on the structure (for the embeddings), we can tackle the next steps I'm cc'ing Erin to keep her in the loop -Shantanu ---------- From: Juan Caicedo Hi Shantanu, Thank you so much for getting this started and for all the instructions to proceed! Just to clarify, this dataset is not useful for biological analysis, this is only a training resource. So we don't plan on releasing embeddings, and we don't plan on releasing the original full images. We only want to make the single cell images and their metadata available for future use in machine learning algorithms. How these single cells were obtained is something that we will document, so we can point to the original sources and list the treatments (wells and plates) that we sampled. Does this make sense? Regarding the embeddings, we are happy to share the features processed with our technique for existing datasets (e.g. TAORF, CDRP, BBBC022). Can we append these features to existing datasets? Nikita will follow the instructions to make the dataset public during the next few days. Nikita, please let us know if you have any questions! Thank you! Juan C. ---------- From: Shantanu Singh Hi Juan, Thanks for the clarifications All this makes sense. It would be great if Nikita could ponder an appropriate folder structure for sharing the data, keeping the current structure in view. The structure is essential because it will set a precedent for future datasets of this nature. Regarding the other embeddings for existing data – that would be fantastic! Nikita, can you organize it in the proposed folder structure for each dataset? https://github.com/broadinstitute/cellpainting-gallery/pull/19/files (or proposed changes to the structure) S ---------- From: Moshkov Nikita Hi Shantanu, Most of the folders in the documented structure don't seem to be needed for this dataset. The images are stored in the following structure: Source dataset -> Plate ID - Well - Site. From the workspace directory, we only need one for metadata. Outlines are already part of the images. The example folder structure is in the image attached. I believe that training resources should have a simple folder structure. Regarding the embeddings: we have the embeddings for BBBC022, CDRP and TA-ORF. Should we just put the embeddings in the dataset folders? BBBC022 does not seem to be in the gallery. Do we want to share only the embeddings extracted with the Cell Painting CNN model or with other models too? Note that for the extraction of embeddings we used slightly different metadata (in short, it means that we did not extract embedding from all images, but only the ones which passed out QC). Those are npz files if it matters. Thank you! image.png ---------- From: Juan Caicedo Hi Nikita, Great that you are looking into this! I agree that the folder structure for the single cell images may be different. We are following the way other machine learning datasets are organized to help researchers in the field use it out of the box, and lower the barrier of entry. On the embeddings side, I think we only need to make the Cell Painting CNN embeddings public, and we should release all levels of profiling (from single cells to sphered well-level and aggregated treatment-level). Best, Juan C. ---------- From: Shantanu Singh < Sounds good Let's go with what you recommended, just that we should call the `images` folder something else – let's go with `training_images`? https://github.com/broadinstitute/cellpainting-gallery/issues/20#issuecomment-1249864282 > On the embeddings side, I think we only need to make the Cell Painting CNN embeddings public, and we should release all levels of profiling (from single cells to sphered well-level and aggregated treatment-level). Fantastic! Are we good with the structure proposed here for that https://github.com/broadinstitute/cellpainting-gallery/pull/19/files? (and use npz files instead of parquet) > BBBC022 does not seem to be in the gallery. That's right but we can get that ready while you are preparing the data So in summary, you will have just 3 folders: - training_images - workspace/profiles - workspace/metadata Please LMK if you have any questions. ---------- From: Moshkov Nikita Hi Shantanu, Rebecca, Erin, Juan, I have put together the dataset for publishing. Shantanu and Juan, I guess it makes more sense to put our embeddings to a separate folder instead of putting them to dataset folders OR make a single folder for DeepProfiler paper (similarly to M.Rohban's heterogeneity paper) and put everything there. Currently, the folder structure is the following: ``` cellpainting-gallery └── Broad-CP-TrainingSet2022 └── broad ├── training_images ├── TAORF (same for other datasets) └── Plate Id └── Well └── Site ├── BBBC022 ├── LUAD ├── LINCS └── CDRP └── workspace └── metadata └── sc-metadata.csv ``` Metadata file is adjusted to have a relative path to images in this folder structure. If no concerns, the dataset is ready to be uploaded and now is available on DGX in the folder: /raid/data/cellpainting/Broad-CP-TrainingSet2022/ I can prepare the embeddings to be uploaded either as a separate folder or as a part of a single folder later. Thank you! ---------- From: Shantanu Singh > make a single folder for DeepProfiler paper (similarly to M.Rohban's heterogeneity paper) and put everything there. I like this idea Does this structure work for you? If not, please feel free to propose alternatives https://github.com/broadinstitute/cellpainting-gallery/pull/19/files ---------- From: Moshkov Nikita Hi Shantanu, thank you for looking into this! I have added some comments and questions to the PR you shared: https://github.com/broadinstitute/cellpainting-gallery/pull/19/files We have slightly different structure in DeepProfiler: /Plate/Well/Site.npz Maybe we could unify this together? Thank you! ---------- From: Shantanu Singh Thanks, Nikita. I have responded; have a look. > /Plate/Well/Site.npz Would this modification work – plate/well/site/embedding.npz? in favor of encoding the structure in the folder vs in the file Here's what that would look like ``` └── embeddings ├── 2021_04_26_Batch1 │ ├── BR00117035 │ │ └── efficientnet_v2_imagenet1k_s_feature_vector_2_ec756ff │ │ ├── A01 │ │ │ └── 1 │ │ │ └── embedding.npz │ │ └── A02 │ └── BR00117036 └── 2021_05_31_Batch2 ``` ---------- From: Shantanu Singh Nikita – I've made a bunch of changes to align with DeepProfiler output https://github.com/broadinstitute/cellpainting-gallery/blob/embeddings/folder_structure.md#embeddings-folder-structure If this looks good, I'll go ahead and merge ---------- From: Moshkov Nikita Hi Shantanu, great, thank you! I am going to make some adjustments and then share the folder structure with you. We would like to put well-level and treatment-level profiles in the analysis folder, is it ok if we put just full CSV files without splitting those by plate? Thank you! ---------- From: Shantanu Singh > We would like to put well-level and treatment-level profiles in the analysis folder, is it ok if we put just full CSV files without splitting those by plate? As such - Well-level profiles should go here https://github.com/broadinstitute/cellpainting-gallery/blob/embeddings/folder_structure.md#profiles-folder-structure - We don't have a location for treatment-level profiles (it's something we just do on the fly) - We’d certainly want to split by plate But I'm open to suggestions ---------- From: Moshkov Nikita Hi Shantanu, - We’d certainly want to split by plate Will do. - We don't have a location for treatment-level profiles (it's something we just do on the fly) Could it work if I create a folder on the same level with batches named "full_profiles" and put there concatenated well-level and treatment-level profiles? Thank you! ---------- From: Shantanu Singh Hi Nikita, Turns out we did decide on a folder structure for those concatenated well-level and treatment-level profiles From https://github.com/cytomining/profiling-handbook/issues/54#issue-610880499 - consensus (treatment-level) - collated (well-level) There is a single file per batch because it assumes all replicates are in the same batch, but I think it is wise to skip the batch structure and have a single file directly under that folder without any further nesting. However, taking a step back, I realized it's wisest to split off DL-generated features into a different workspace folder. It's going to get too confusing to have DL-derived data components intermingle with CellProfiler-derived data components. Further, there will likely be several different versions of DL-generated features (vs. CellProfiler, which is relatively stable) – and so the nesting structure should more conveniently allow for this. Here's what I came up with, hopefully making it easier for you https://github.com/broadinstitute/cellpainting-gallery/blob/1f999572b7b40f8702a71684de4d145ff2c50674/folder_structure.md#workspace_dl-folder-structure Diff https://github.com/broadinstitute/cellpainting-gallery/commit/1f999572b7b40f8702a71684de4d145ff2c50674 Aside: We should make our best effort to create a sensible folder structure, but ultimately, the rigidity of folder structures will end up being too constraining, and we may (later) have to rely on configuration files that specify what's where. Just giving you a heads-up that this might happen in the future, but nothing for you to do right now. Shantanu ---------- From: Moshkov Nikita Hi Shantanu, Turns out we did decide on a folder structure for those concatenated well-level and treatment-level profiles From https://github.com/cytomining/profiling-handbook/issues/54#issue-610880499 - consensus (treatment-level) - collated (well-level) There is a single file per batch because it assumes all replicates are in the same batch, but I think it is wise to skip the batch structure and have a single file directly under that folder without any further nesting. Got it. Thank you! I have reorganized the folder (please see it in the related issue: https://github.com/broadinstitute/cellpainting-gallery/issues/20#issuecomment-1285787463). I did not add the additional notebook for reading the features, not sure if it is needed (the folder structure differs a little from DeepProfiler's output). FYI: LUAD was renamed to BBBC043, though it is not public yet: https://github.com/broadinstitute/imaging-bbbc/issues/52 Thank you!