broadinstitute / pooled-cell-painting-profiling-recipe

:woman_cook: Recipe repository for image-based profiling of Pooled Cell Painting experiments
BSD 3-Clause "New" or "Revised" License
6 stars 4 forks source link

enhancement/fix needed in saturated_sites.csv #41

Closed gwaybio closed 4 years ago

gwaybio commented 4 years ago

I am going through 4.image-and-segmentation.py in much more detail in #39 - the following code block is giving me some trouble:

https://github.com/broadinstitute/pooled-cell-painting-profiling-recipe/blob/b8c9288c0ab8b04e4ce284be5ba9aab44a6ce678/0.preprocess-sites/4.image-and-segmentation-qc.py#L345-L357

I am trying to think through what this is doing (so please bear with me!)

There are two different ImageQuality_PercentMaximal types (spots and cells) and this column, in image.csv indicates whether or not that image (either spots or cells) is over saturated? We also want these sites to be output to a file named saturated_sites.csv but only if any saturated sites exist. Is this correct?

If so, then I think we may need to slightly tinker the code to make sure we extract all saturated sites. Sorry I didn't flag this sooner!

ErinWeisbart commented 4 years ago

There are two categories of images - cell painting images (that make the cp_sat_df) and barcoding images (that make the bc_sat_df). Each has a different threshold for triggering a saturated warning - CP if an image is > 1% saturated, BC if an image > .2% saturated (which is determined by the ImageQuality_PercentMaximal_ column in the Image.csv).

My goal was to output a list of sites that fail as saturated_sites.csv but to NOT save an empty .csv so that a user will know that if a .csv is created it means there is a problem.

I had also intended for no site to be listed twice (i.e. if it fails in saturation in both CP and BC then it should only be listed once), but now that I'm thinking about it, an even more helpful solution would be to have any site only ever listed once, but for there to be additional columns added of "Fails_Sat_BC" and "Fails_Sat_CP" and have those populated with True/False accordingly.

gwaybio commented 4 years ago

ok cool - I think I got it.

39 is already getting kinda bloated though - Let's address this issue in the next PR. After we merge #39, do you want to give this enhancement a shot? I'm going to shift my focus early next week to organizing the template, migrating working config files, and actually performing our first weld.

ErinWeisbart commented 4 years ago

Happy to give it a shot.

we may need to slightly tinker the code to make sure we extract all saturated sites

I'm not sure I understand what you were initially concerned about? Does my suggested enhancement of

to have any site only ever listed once, but for there to be additional columns added of "Fails_Sat_BC" and "Fails_Sat_CP" and have those populated with True/False accordingly.

fix your initial concern?

gwaybio commented 4 years ago

yeah, I think reworking the full code block to add this enhancement will address my "tinker comment".

More specifically, the two lines of code of this form: cp_sat_df = image_df[image_df[col] > 1] inside the for loop will be overwritten each time through the loop. Therefore, unless I am missing something, sat_df = cp_sat_df.append(bc_sat_df).drop_duplicates(subset="site") will not contain the intended outcome.