Cell health cell count model is correlated to CellProfiler cc_cc_n_objects but does not use the feature in the model

gwaybio commented 4 years ago

I start by testing if viability readouts in the PRISM assay for A549 are similar to cell health model estimates. The DepMap data are accessed here.

The data ☝️ are available under CC BY 4.0 and is distributed (with processing code) in #104. I also make sure to attribute Corsello et al. 2019 in a README file included in #104 (in data/ folder)

In this module, I first process A549 viability data from the Cancer Dependency Map. Next, I merge the viability data with the cell health model scores applied to the same perturbations in A549.

Important Notes

Not all doses measured are consistent between the two datasets. I recoded doses to a range of 1-7 as we did previously, and I calculate the Spearman correlation between the two viability estimates.
In talking with the DepMap folks (Anup, Jacob, and Nick) at CLUE office hours (this was very useful btw) it seems there are some caveats to this data:
- It is a PRISM screen, so it will not be as accurate as A549-specific CellTiter-Glo readouts
- The CTG assay data exist somewhere (probably) but where it exists is not open source

Results

We see high Spearman correlation between the two viability estimates (which is pretty cool!)

viability_results

cc @AnneCarpenter @shntnu

shntnu commented 4 years ago

@gwaygenomics I wonder whether this is a particularly convincing example because IIUC this readout (viability) can be gotten directly from Cells_Count in Cell Painting data (after z-scoring wrt DMSO per plate), right?

I couldn't easily check, but I assume the X-axis is cc_cc_n_objects?

Oh but maybe it is cc_all_n_objects? Either way, I'd imagine Cells_Count would report it?

shntnu commented 4 years ago

I looked up the model summaries for cc_all_n_objects and couldn't figure out whether Cells_Count, or its (occasionally misleading) proxy Cells_Number_Object_Number was among them.

gwaybio commented 4 years ago

I wonder whether this is a particularly convincing example because IIUC this readout (viability) can be gotten directly from Cells_Count in Cell Painting data (after z-scoring wrt DMSO per plate), right?

Hmm, interesting, I didn't know Cells_Count was a feature typically output in CellProfiler pipelines. I just checked for this dataset, and it doesn't look like its measured here. Do you know why?

Also, I'm not sure why this isn't a convincing example (convincing in the sense that the cell health approach picks up on signal of other assay readouts). I agree that it is not the most convincing, since the variable being predicted is essentially cell count. I think it still does validate the approach (kinda like a positive control) in that a specific combination of morphology features (75 non-zero model coefficients (see below) that doesn't actually include cell count) provides viability information and it generalizes to data the model wasn't trained on.

I couldn't easily check, but I assume the X-axis is cc_cc_n_objects?

Yep, that is the x-axis https://github.com/broadinstitute/cell-health/pull/104/files#diff-97f7b948d26a1382ffa7c7c8f44c9398R26

Predicting `cc_cc_n_objects`

features	cell_health_modz_target_cc_cc_n_objects
Cells_Neighbors_AngleBetweenNeighbors_Adjacent	0.055164946
Cells_Neighbors_AngleBetweenNeighbors_10	0.055061161
Cytoplasm_Texture_SumEntropy_RNA_10_0	0.040604382
Nuclei_Texture_Entropy_Mito_20_0	0.04047636
Nuclei_AreaShape_Extent	0.031029096
Nuclei_Granularity_5_DNA	0.029442897
Cytoplasm_Texture_SumEntropy_RNA_5_0	0.028476693
Cytoplasm_AreaShape_Zernike_3_1	0.024596425
Cells_Texture_AngularSecondMoment_Mito_5_0	0.023562192
Cells_Neighbors_SecondClosestObjectNumber_10	0.023006226
Cells_Neighbors_SecondClosestObjectNumber_Adjacent	0.022527457
Cells_Intensity_LowerQuartileIntensity_Mito	0.02066621
Nuclei_Texture_Variance_RNA_10_0	0.020602156
Nuclei_Neighbors_SecondClosestObjectNumber_2	0.020209564
Nuclei_Texture_SumAverage_DNA_10_0	0.017204563
Nuclei_Number_Object_Number	0.016955112
Cytoplasm_Intensity_MedianIntensity_Mito	0.016352448
Nuclei_Neighbors_FirstClosestObjectNumber_2	0.01631829
Cells_Number_Object_Number	0.015590694
Cells_Parent_Nuclei	0.015136083
Cells_Neighbors_FirstClosestObjectNumber_10	0.014818982
Cells_Neighbors_FirstClosestObjectNumber_Adjacent	0.014687625
Cytoplasm_Parent_Nuclei	0.013747513
Cytoplasm_Correlation_RWC_DNA_ER	0.013497815
Cells_AreaShape_Zernike_2_2	0.013426709
Cytoplasm_Parent_Cells	0.013285243
Cytoplasm_Number_Object_Number	0.012886616
Cytoplasm_Texture_Gabor_DNA_20	0.012321906
Nuclei_Texture_DifferenceVariance_DNA_10_0	0.011527971
Cells_Correlation_Overlap_ER_RNA	0.011446247
Cytoplasm_AreaShape_Zernike_7_7	0.011257391
Nuclei_Granularity_6_DNA	0.010945255
Nuclei_Texture_Variance_RNA_20_0	0.010458264
Nuclei_Texture_Contrast_Mito_10_0	0.009910962
Cells_Correlation_RWC_AGP_DNA	0.009896389
Nuclei_Texture_Variance_RNA_5_0	0.009315511
Nuclei_AreaShape_FormFactor	0.008371896
Cytoplasm_Intensity_LowerQuartileIntensity_Mito	0.006417364
Cells_Granularity_14_ER	0.005964597
Nuclei_Texture_DifferenceEntropy_DNA_10_0	0.005666956
Nuclei_Texture_Gabor_RNA_20	0.004616666
Cells_Neighbors_PercentTouching_10	0.004230736
Nuclei_Texture_Contrast_Mito_5_0	0.004108499
Nuclei_Location_MaxIntensity_X_AGP	0.003871409
Cytoplasm_Texture_Entropy_DNA_5_0	0.003071451
Nuclei_Intensity_MADIntensity_Mito	0.001248981
Nuclei_Texture_Variance_AGP_20_0	0.000938841
Cells_Intensity_MeanIntensityEdge_Mito	0.000270914
Nuclei_AreaShape_Zernike_8_8	-0.000473131
Cytoplasm_Correlation_Correlation_Mito_ER	-0.001319327
Cells_RadialDistribution_MeanFrac_ER_3of4	-0.002680755
Cytoplasm_Texture_AngularSecondMoment_DNA_10_0	-0.002683172
Cells_Correlation_Correlation_Mito_ER	-0.002931453
Cytoplasm_Correlation_RWC_Mito_AGP	-0.004001291
Nuclei_AreaShape_Zernike_2_0	-0.004981558
Cytoplasm_Texture_InfoMeas1_DNA_20_0	-0.005824239
Nuclei_RadialDistribution_MeanFrac_RNA_3of4	-0.007414817
Cytoplasm_RadialDistribution_MeanFrac_ER_1of4	-0.008121384
Cytoplasm_Granularity_13_RNA	-0.009183179
Nuclei_RadialDistribution_FracAtD_RNA_3of4	-0.010865844
Nuclei_AreaShape_Zernike_8_4	-0.012605199
Nuclei_AreaShape_Zernike_4_4	-0.014489444
Nuclei_Granularity_9_DNA	-0.018934728
Nuclei_Granularity_8_DNA	-0.019095627
Cells_Granularity_14_AGP	-0.02049522
Cells_RadialDistribution_RadialCV_RNA_3of4	-0.027511111
Cytoplasm_Texture_InfoMeas1_AGP_20_0	-0.028241055
Cells_Correlation_RWC_Mito_AGP	-0.0285302
Cytoplasm_AreaShape_Zernike_9_1	-0.03053135
Cytoplasm_Correlation_Correlation_Mito_AGP	-0.032825117
Nuclei_AreaShape_Zernike_6_0	-0.039869409
Cells_Correlation_RWC_RNA_AGP	-0.040048675
Cytoplasm_Correlation_RWC_Mito_ER	-0.044119235
Cells_AreaShape_Zernike_5_3	-0.057635367
Nuclei_Texture_InverseDifferenceMoment_DNA_5_0	-0.062554238

shntnu commented 4 years ago

Hmm, interesting, I didn't know Cells_Count was a feature typically output in CellProfiler pipelines. I just checked for this dataset, and it doesn't look like its measured here. Do you know why?

That is present in the Image.csv file and doesn't get reported in the aggregated profiles in our current workflow (you'd need to run a separate command for that; see https://github.com/broadinstitute/cmQTL/blob/master/1.profile-cell-lines/0.generate-profiles.sh#L235-L247)

However, Cells_Number_Object_Number turns out to be a pretty good proxy. E.g. I took a plate from cmQTL (private repo): Cell counts: https://github.com/broadinstitute/cmQTL/blob/master/1.profile-cell-lines/profiles/BR00106708_count.csv Profiles: https://github.com/broadinstitute/cmQTL/blob/master/1.profile-cell-lines/profiles/BR00106708_augmented.csv

library(tidyverse)
BR00106708_count <- read_csv("~/work/projects/2018_06_05_cmQTL/workspace/backend/2019_08_15_Batch4/BR00106708/BR00106708_count.csv")
BR00106708 <- read_csv("~/work/projects/2018_06_05_cmQTL/workspace/backend/2019_08_15_Batch4/BR00106708/BR00106708.csv")
x <- BR00106708 %>% select(Metadata_Plate, Metadata_Well, Cells_Number_Object_Number)
y <- BR00106708_count %>% select(Metadata_Plate, Metadata_Well, Count_Cells)
z <- inner_join(x, y)
#> Joining, by = c("Metadata_Plate", "Metadata_Well")
ggplot(z, aes(Count_Cells, Cells_Number_Object_Number)) + geom_point()

with(z, cor(Count_Cells, Cells_Number_Object_Number))
#> [1] 0.9997614

Now you do have Cells_Number_Object_Number among your features, so thats good. But is a bit surprising why it doesn't have a higher weight in your model. Can you plot Cells_Number_Object_Number vs cc_cc_n_objects in the Cell Health data; this might reveal something.

Also, I'm not sure why this isn't a convincing example (convincing in the sense that the cell health approach picks up on signal of other assay readouts). I agree that it is not the most convincing, since the variable being predicted is essentially cell count. I think it still does validate the approach (kinda like a positive control) in that a specific combination of morphology features (75 non-zero model coefficients (see below) that doesn't actually include cell count) provides viability information and it generalizes to data the model wasn't trained on.

Great point, you are right in that it definitely serves as a positive control in the ways you described, so we should certainly keep this analysis in the paper. I suppose my argument is just that leaving cell count out makes this a toy problem because counts are easily reported by an imaging assay.

But if it turns out that cc_cc_n_objects is actually measuring something different from cell counts (which your scatter plot will reveal), then this will even more interesting (but we'd first need to probe why it is different from cell counts)

gwaybio commented 4 years ago

@shntnu and I discussed this in person just now. Here is what we did:

We took the model output scores for all samples for two models (["cc_cc_n_objects", "cc_all_n_objects"]) and the Cell Painting feature Cells_Number_Object_Number. We plotted the scores for each model against this feature, and we observed that the measures are indeed highly correlated.

`cc_cc_n_objects`

The strange tails are driven by cell-line

`cc_all_n_objects`

We see essentially the same pattern for cc_all_n_objects

High Pearson Correlation

Conclusion

We'll keep the analysis in the paper as it does demonstrate a positive control (for the reasons we previously stated https://github.com/broadinstitute/cell-health/issues/105#issuecomment-588384381. However, it will be important to make it clear that this information is contained in CellPainting assays by default, and this information doesn't require a model to make predictions.

It is also interesting, as @shntnu noted in https://github.com/broadinstitute/cell-health/issues/105#issuecomment-588472491 that the CellPainting feature Cells_Number_Object_Number is given a relatively low (but still non-zero) weight by the n_objects models. We determined that this is likely a result of the correlational structure in Cell Painting data (we don't do our typical feature selection approach here since our models will induce sparsity on their own).

Code

n_obj = pd.DataFrame(all_scores.loc[:, ["cc_cc_n_objects", "cc_all_n_objects"]])

x = pd.DataFrame(
    pd.concat(
        [x_train_df.assign(model="train"), x_test_df.assign(model="test")],
        axis="rows")
    .loc[:, ["Metadata_cell_line", "Metadata_pert_name", "model", "Cells_Number_Object_Number"]]
)

out = x.merge(n_obj, left_index=True, right_index=True)
out.head()

	Metadata_cell_line	Metadata_pert_name	model	Cells_Number_Object_Number	cc_cc_n_objects
HCC44	RHOA-2	train	0.171472	0.237067	0.378426
A549	ATP50-1	train	-0.428993	0.042507	0.049708
A549	MTOR-1	train	-0.088086	0.125309	0.107420
HCC44	EZH2-1	train	0.634026	0.591529	0.608119
HCC44	HIF1a-1	train	-0.128264	0.383806	0.352876

shntnu commented 4 years ago

Thanks for that excellent summary!

AnneCarpenter commented 4 years ago

Great sleuthing in this thread. It raises the question of why we don't include per-image features in our profiles, in addition to per-cell features?

Second question: i didn't understand how doses were handled, can you clarify? The choices in doing so might lead to a lot of noise.

gwaybio commented 4 years ago

It raises the question of why we don't include per-image features in our profiles, in addition to per-cell features?

Ah, interesting! To clarify my thinking, does this mean that specific whole image features are computed (like confluence and other things?) and then concatenated to aggregated single cell profiles?

i didn't understand how doses were handled, can you clarify? The choices in doing so might lead to a lot of noise.

Thanks for asking this question! I dug into answering this and figured out that I was using less than half of the PRISM data in this analysis. For some reason, the PRISM results are split into "primary" and "secondary" data (seen here). I was only using primary here.

PRISM Dose Info

Drug Repurposing

In the repurposing hub data, the doses were calculated on micromoles per liter while the PRISM data does not specify the units of dose clearly. It is likely the same (given the above figures). We know that the values are not exactly the same (see below), so when we recode dose to numerical levels (to account for jitter in micromoles per liter measurements), we do introduce a bit of noise.

Conclusion

After taking a deeper look at the available data (thanks again for the question), I've included more data. After adding this data, the correlation strength increases:

Although it is clear there is not a 1:1 correspondence between doses tested across both datasets, I am less concerned that there are huge systematic differences.

I also have a question out to the CLUE team about dose units in the PRISM assay, so we will know for sure what we're dealing with soon.

Note: Code updated here: 2660212664d78a5864bb8f6c1916dcd749a7bad1

shntnu commented 4 years ago

@gwaygenomics I'll chat w you in person now

gwaybio commented 4 years ago

Thanks for clarification @shntnu ! Here is ultimately what we discussed (pasted below from a private confluence page)

Stock concentration in compound plate = Stock concentration of the compound in a well of the compound plate. Typically in mM (millimolar)

Volume transferred to assay plate = Volume of the compound to be transferred from a well in the compound plate to a well in the assay plate. Typically in nL (nanoliter)

Volume in assay plate = Volume of culture (a.k.a. medium) that is present in a well of the assay plate. Typically in uL (microliter)

Assay concentration = Resulting concentration of the compound in a well of the assay plate. Typically in uM (micromolar)

Example from a LINCS Pilot 1 plate (from SQ00015201.csv obtained via Pipeline Pilot > Screening_Vial_and_Plate_Content_from_Barcode_or_Platemap)

Plate_Barcode SQ00015201 Well_Position A10 Broad_Sample BRD-K50163129-001-01-4 mg_per_ml 0.129051181571752 mmoles_per_liter 0.37037036 Solvent DMSO Vol_uL 0.05

Separately, we know that volume of media in assay plate for this particular plate was 50uL.

0.37mM * 0.05uL / 50uL = 0.37uM

broadinstitute / cell-health