broadinstitute / pooled-cell-painting-profiling-recipe

:woman_cook: Recipe repository for image-based profiling of Pooled Cell Painting experiments
BSD 3-Clause "New" or "Revised" License
6 stars 4 forks source link

Confusing code block in 4.image-and-segmentation-qc.py #40

Closed gwaybio closed 4 years ago

gwaybio commented 4 years ago

The following code block is giving me problems to reproduce šŸ‘‡ (copied in full below as well)

https://github.com/broadinstitute/pooled-cell-painting-profiling-recipe/blob/6ff26253d6d653f012878ee16a82ad34c089bf53/0.preprocess-sites/4.image-and-segmentation-qc.py#L459-L479

In #39 I move the image.csv processing away from 4.image-and-segmentation-qc.py into an earlier step. In this way we are able to propagate important column metadata through in earlier files. This makes things way less fragile.

Anyways, in the new image.csv processing, I am not finding any columns containing the string "CorrelationCorrelation". Because we're missing that string, this code block fails.

# Create list of questionable channel correlations (alignments)
corr_df_cols = ["Plate", "Well", "Site", "site"]
corr_cols = []
for col in image_df.columns:
    if "Correlation_Correlation_" in col:
        corr_cols.append(col)
        corr_df_cols.append(col)
image_corr_df = image_df[corr_df_cols]
image_corr_list = []
for col in corr_cols:
    image_corr_list.append(
        image_corr_df.loc[image_corr_df[col] < correlation_threshold]
    )
image_corr_df = pd.concat(image_corr_list).drop_duplicates(subset="site").reset_index()
for col in corr_cols:
    image_corr_df.loc[(image_corr_df[col] >= correlation_threshold), col] = "pass"

if len(image_corr_df.index) > 0:
    corr_output_file = pathlib.Path(results_output, "flagged_correlations.csv")
    if check_if_write(corr_output_file, force, throw_warning=True):
        image_corr_df.to_csv(corr_output_file)

edit to incude error message (my bad for not including in the first place)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-79-1da2ddc0ea9c> in <module>
     15         image_corr_df.loc[image_corr_df[col] < correlation_threshold]
     16     )
---> 17 image_corr_df = pd.concat(image_corr_list).drop_duplicates(subset="site").reset_index()
     18 for col in corr_cols:
     19     image_corr_df.loc[(image_corr_df[col] >= correlation_threshold), col] = "pass"

~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    279         verify_integrity=verify_integrity,
    280         copy=copy,
--> 281         sort=sort,
    282     )
    283 

~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    327 
    328         if len(objs) == 0:
--> 329             raise ValueError("No objects to concatenate")
    330 
    331         if keys is None:

ValueError: No objects to concatenate
ErinWeisbart commented 4 years ago

Are you working off of CP151 data? Older batches don't have those measurements.

ErinWeisbart commented 4 years ago

We could add in for this step (as well as the PLLS and saturation plots) a logic that says if those columns can't be found then skip that unit in case a user accidentally removes those measurements from the CP pipeline?

gwaybio commented 4 years ago

Are you working off of CP151 data?

Yep

We could add in for this step (as well as the PLLS and saturation plots) a logic that says if those columns can't be found then skip that unit in case a user accidentally removes those measurements from the CP pipeline?

I am 100% for this strategy. In general, I'm a tad bit concerned that these QC scripts will fail out if CellProfiler output updates. This is exactly what I was trying to adapt to in the beginning! I will add the checks in #39

ErinWeisbart commented 4 years ago

Are you working off of CP151 data?

Yep

There are Correlation_Correlation_ columns in all the Image.csv files output by CP for all the CP151 data, so we need to take another look at the creation of the image_df?

ErinWeisbart commented 4 years ago

Oops! I take that back. They are there for CP151A2 and B2 but not for A1 and B1. So sorry about the confusion.

ErinWeisbart commented 4 years ago

They should be in all future batches, but adding in the check if column exists should solve the issue now and in the future :)

gwaybio commented 4 years ago

Got it! Super helpful comments

I add column checks in eff296e32402fb92981d544c2fccc3eeadbc1eaf. I can also confirm that the code runs in the other wells when Correlation_Correlation_ is present