broadinstitute / pooled-cell-painting-profiling-recipe

:woman_cook: Recipe repository for image-based profiling of Pooled Cell Painting experiments
BSD 3-Clause "New" or "Revised" License
6 stars 4 forks source link

Adding single cell processing script #8

Closed gwaybio closed 4 years ago

gwaybio commented 4 years ago

Changes

This PR adds the first step in the 1.generate-profiles module, called 0.merge-single-cells.py. This file will build and save single cell profiles per site. I also include an option to concatenate all single cell profiles from all sites into a single file.

I also add an example profiling configuration yaml file, and I modify the configuration file processing to build more specific paths needed for the single cell processing.

How this differs from the previous approach

In the previous profiling pipeline, I performed this step alongside aggregation (see here). I decided to split out the single cell processing from the aggregation because it makes repo maintenance easier. Also, splitting up this step separates out single cell preprocessing from the more streamlined image-based profiling pipeline. Previously, I wasn't outputting single cell profiles either. 👈 This is the primary reason for combining the single cell processing with the aggregation.

It is possible that in the future we extract this logic further out into an external package, but this is a good line to draw between project agility and software best practices.

Review Notes

@ErinWeisbart we are only going to get further away from the image analysis pipeline from here on out. Please do scrutinize these scripts when you get a chance. Am I making any assumptions that I'm unaware of? Is there a better way to do any of the things I am proposing? This PR can stay open until you get a chance to take a closer look. But remember to also keep in mind that we can make more changes after merging.

Example Data Stats

I've run a subset of CP074A through this pipeline so far. (side note - Maybe I should be running CP151 as an example? 🤔 @ErinWeisbart is this more trouble than it's worth?)

Number of Sites: 80 Site-Specific Single Cell File Sizes: roughly between 900KB and 13MB (on average ~5MB) All Single Cells Concatenated in One File: 334 MB

This is all within range of git LFS

ErinWeisbart commented 4 years ago

@gwaygenomics as far as sample data, it seems good to stick with CP074B since we have that run already with our old workflow and we can directly compare to previous results to ensure there is no unexpected behavior in this workflow. If you're wanting to see how it performs on previously un-touched data, CP151A1 or B1 would be the best batch to start with. (Wait on CP151A2 and B2, though they're a good next test set as we're not sure how robust the profiles will be coming out of an imperfect dataset)

ErinWeisbart commented 4 years ago

There are a few different points where there is the option to overwrite errors with --force. That's not coming from the config file, correct? Shouldn't those be extracted to the config file?

gwaybio commented 4 years ago

There are a few different points where there is the option to overwrite errors with --force. That's not coming from the config file, correct? Shouldn't those be extracted to the config file?

This is a great point - this is an issue we must deal with when we move closer to production. I've moved it over to https://github.com/broadinstitute/pooled-cell-painting-profiling-recipe/issues/10#issuecomment-632244564

gwaybio commented 4 years ago

@ErinWeisbart - I made all of the recommended changes. These suggestions are great! The PR is ready for your re-review (in general, lets not merge unless the magic green approved button is pressed by either one of us)

Do note that I moved project to the master config section and renamed it project_tag. This likely will slightly impact your documentation efforts :)