Closed gwaybio closed 3 years ago
@gwaygenomics I wonder whether this is a particularly convincing example because IIUC this readout (viability) can be gotten directly from Cells_Count in Cell Painting data (after z-scoring wrt DMSO per plate), right?
I couldn't easily check, but I assume the X-axis is cc_cc_n_objects
?
Oh but maybe it is cc_all_n_objects
? Either way, I'd imagine Cells_Count
would report it?
I looked up the model summaries for cc_all_n_objects
and couldn't figure out whether Cells_Count
, or its (occasionally misleading) proxy Cells_Number_Object_Number
was among them.
I wonder whether this is a particularly convincing example because IIUC this readout (viability) can be gotten directly from Cells_Count in Cell Painting data (after z-scoring wrt DMSO per plate), right?
Hmm, interesting, I didn't know Cells_Count
was a feature typically output in CellProfiler pipelines. I just checked for this dataset, and it doesn't look like its measured here. Do you know why?
Also, I'm not sure why this isn't a convincing example (convincing in the sense that the cell health approach picks up on signal of other assay readouts). I agree that it is not the most convincing, since the variable being predicted is essentially cell count. I think it still does validate the approach (kinda like a positive control) in that a specific combination of morphology features (75 non-zero model coefficients (see below) that doesn't actually include cell count) provides viability information and it generalizes to data the model wasn't trained on.
I couldn't easily check, but I assume the X-axis is cc_cc_n_objects?
Yep, that is the x-axis https://github.com/broadinstitute/cell-health/pull/104/files#diff-97f7b948d26a1382ffa7c7c8f44c9398R26
cc_cc_n_objects
features | cell_health_modz_target_cc_cc_n_objects |
---|---|
Cells_Neighbors_AngleBetweenNeighbors_Adjacent | 0.055164946 |
Cells_Neighbors_AngleBetweenNeighbors_10 | 0.055061161 |
Cytoplasm_Texture_SumEntropy_RNA_10_0 | 0.040604382 |
Nuclei_Texture_Entropy_Mito_20_0 | 0.04047636 |
Nuclei_AreaShape_Extent | 0.031029096 |
Nuclei_Granularity_5_DNA | 0.029442897 |
Cytoplasm_Texture_SumEntropy_RNA_5_0 | 0.028476693 |
Cytoplasm_AreaShape_Zernike_3_1 | 0.024596425 |
Cells_Texture_AngularSecondMoment_Mito_5_0 | 0.023562192 |
Cells_Neighbors_SecondClosestObjectNumber_10 | 0.023006226 |
Cells_Neighbors_SecondClosestObjectNumber_Adjacent | 0.022527457 |
Cells_Intensity_LowerQuartileIntensity_Mito | 0.02066621 |
Nuclei_Texture_Variance_RNA_10_0 | 0.020602156 |
Nuclei_Neighbors_SecondClosestObjectNumber_2 | 0.020209564 |
Nuclei_Texture_SumAverage_DNA_10_0 | 0.017204563 |
Nuclei_Number_Object_Number | 0.016955112 |
Cytoplasm_Intensity_MedianIntensity_Mito | 0.016352448 |
Nuclei_Neighbors_FirstClosestObjectNumber_2 | 0.01631829 |
Cells_Number_Object_Number | 0.015590694 |
Cells_Parent_Nuclei | 0.015136083 |
Cells_Neighbors_FirstClosestObjectNumber_10 | 0.014818982 |
Cells_Neighbors_FirstClosestObjectNumber_Adjacent | 0.014687625 |
Cytoplasm_Parent_Nuclei | 0.013747513 |
Cytoplasm_Correlation_RWC_DNA_ER | 0.013497815 |
Cells_AreaShape_Zernike_2_2 | 0.013426709 |
Cytoplasm_Parent_Cells | 0.013285243 |
Cytoplasm_Number_Object_Number | 0.012886616 |
Cytoplasm_Texture_Gabor_DNA_20 | 0.012321906 |
Nuclei_Texture_DifferenceVariance_DNA_10_0 | 0.011527971 |
Cells_Correlation_Overlap_ER_RNA | 0.011446247 |
Cytoplasm_AreaShape_Zernike_7_7 | 0.011257391 |
Nuclei_Granularity_6_DNA | 0.010945255 |
Nuclei_Texture_Variance_RNA_20_0 | 0.010458264 |
Nuclei_Texture_Contrast_Mito_10_0 | 0.009910962 |
Cells_Correlation_RWC_AGP_DNA | 0.009896389 |
Nuclei_Texture_Variance_RNA_5_0 | 0.009315511 |
Nuclei_AreaShape_FormFactor | 0.008371896 |
Cytoplasm_Intensity_LowerQuartileIntensity_Mito | 0.006417364 |
Cells_Granularity_14_ER | 0.005964597 |
Nuclei_Texture_DifferenceEntropy_DNA_10_0 | 0.005666956 |
Nuclei_Texture_Gabor_RNA_20 | 0.004616666 |
Cells_Neighbors_PercentTouching_10 | 0.004230736 |
Nuclei_Texture_Contrast_Mito_5_0 | 0.004108499 |
Nuclei_Location_MaxIntensity_X_AGP | 0.003871409 |
Cytoplasm_Texture_Entropy_DNA_5_0 | 0.003071451 |
Nuclei_Intensity_MADIntensity_Mito | 0.001248981 |
Nuclei_Texture_Variance_AGP_20_0 | 0.000938841 |
Cells_Intensity_MeanIntensityEdge_Mito | 0.000270914 |
Nuclei_AreaShape_Zernike_8_8 | -0.000473131 |
Cytoplasm_Correlation_Correlation_Mito_ER | -0.001319327 |
Cells_RadialDistribution_MeanFrac_ER_3of4 | -0.002680755 |
Cytoplasm_Texture_AngularSecondMoment_DNA_10_0 | -0.002683172 |
Cells_Correlation_Correlation_Mito_ER | -0.002931453 |
Cytoplasm_Correlation_RWC_Mito_AGP | -0.004001291 |
Nuclei_AreaShape_Zernike_2_0 | -0.004981558 |
Cytoplasm_Texture_InfoMeas1_DNA_20_0 | -0.005824239 |
Nuclei_RadialDistribution_MeanFrac_RNA_3of4 | -0.007414817 |
Cytoplasm_RadialDistribution_MeanFrac_ER_1of4 | -0.008121384 |
Cytoplasm_Granularity_13_RNA | -0.009183179 |
Nuclei_RadialDistribution_FracAtD_RNA_3of4 | -0.010865844 |
Nuclei_AreaShape_Zernike_8_4 | -0.012605199 |
Nuclei_AreaShape_Zernike_4_4 | -0.014489444 |
Nuclei_Granularity_9_DNA | -0.018934728 |
Nuclei_Granularity_8_DNA | -0.019095627 |
Cells_Granularity_14_AGP | -0.02049522 |
Cells_RadialDistribution_RadialCV_RNA_3of4 | -0.027511111 |
Cytoplasm_Texture_InfoMeas1_AGP_20_0 | -0.028241055 |
Cells_Correlation_RWC_Mito_AGP | -0.0285302 |
Cytoplasm_AreaShape_Zernike_9_1 | -0.03053135 |
Cytoplasm_Correlation_Correlation_Mito_AGP | -0.032825117 |
Nuclei_AreaShape_Zernike_6_0 | -0.039869409 |
Cells_Correlation_RWC_RNA_AGP | -0.040048675 |
Cytoplasm_Correlation_RWC_Mito_ER | -0.044119235 |
Cells_AreaShape_Zernike_5_3 | -0.057635367 |
Nuclei_Texture_InverseDifferenceMoment_DNA_5_0 | -0.062554238 |
Hmm, interesting, I didn't know
Cells_Count
was a feature typically output in CellProfiler pipelines. I just checked for this dataset, and it doesn't look like its measured here. Do you know why?
That is present in the Image.csv file and doesn't get reported in the aggregated profiles in our current workflow (you'd need to run a separate command for that; see https://github.com/broadinstitute/cmQTL/blob/master/1.profile-cell-lines/0.generate-profiles.sh#L235-L247)
However, Cells_Number_Object_Number
turns out to be a pretty good proxy. E.g. I took a plate from cmQTL (private repo):
Cell counts: https://github.com/broadinstitute/cmQTL/blob/master/1.profile-cell-lines/profiles/BR00106708_count.csv
Profiles: https://github.com/broadinstitute/cmQTL/blob/master/1.profile-cell-lines/profiles/BR00106708_augmented.csv
library(tidyverse)
BR00106708_count <- read_csv("~/work/projects/2018_06_05_cmQTL/workspace/backend/2019_08_15_Batch4/BR00106708/BR00106708_count.csv")
BR00106708 <- read_csv("~/work/projects/2018_06_05_cmQTL/workspace/backend/2019_08_15_Batch4/BR00106708/BR00106708.csv")
x <- BR00106708 %>% select(Metadata_Plate, Metadata_Well, Cells_Number_Object_Number)
y <- BR00106708_count %>% select(Metadata_Plate, Metadata_Well, Count_Cells)
z <- inner_join(x, y)
#> Joining, by = c("Metadata_Plate", "Metadata_Well")
ggplot(z, aes(Count_Cells, Cells_Number_Object_Number)) + geom_point()
with(z, cor(Count_Cells, Cells_Number_Object_Number))
#> [1] 0.9997614
Now you do have Cells_Number_Object_Number
among your features, so thats good. But is a bit surprising why it doesn't have a higher weight in your model. Can you plot Cells_Number_Object_Number
vs cc_cc_n_objects
in the Cell Health data; this might reveal something.
Also, I'm not sure why this isn't a convincing example (convincing in the sense that the cell health approach picks up on signal of other assay readouts). I agree that it is not the most convincing, since the variable being predicted is essentially cell count. I think it still does validate the approach (kinda like a positive control) in that a specific combination of morphology features (75 non-zero model coefficients (see below) that doesn't actually include cell count) provides viability information and it generalizes to data the model wasn't trained on.
Great point, you are right in that it definitely serves as a positive control in the ways you described, so we should certainly keep this analysis in the paper. I suppose my argument is just that leaving cell count out makes this a toy problem because counts are easily reported by an imaging assay.
But if it turns out that cc_cc_n_objects
is actually measuring something different from cell counts (which your scatter plot will reveal), then this will even more interesting (but we'd first need to probe why it is different from cell counts)
@shntnu and I discussed this in person just now. Here is what we did:
We took the model output scores for all samples for two models (["cc_cc_n_objects", "cc_all_n_objects"]
) and the Cell Painting feature Cells_Number_Object_Number
. We plotted the scores for each model against this feature, and we observed that the measures are indeed highly correlated.
cc_cc_n_objects
cc_all_n_objects
We see essentially the same pattern for cc_all_n_objects
We'll keep the analysis in the paper as it does demonstrate a positive control (for the reasons we previously stated https://github.com/broadinstitute/cell-health/issues/105#issuecomment-588384381. However, it will be important to make it clear that this information is contained in CellPainting assays by default, and this information doesn't require a model to make predictions.
It is also interesting, as @shntnu noted in https://github.com/broadinstitute/cell-health/issues/105#issuecomment-588472491 that the CellPainting feature Cells_Number_Object_Number
is given a relatively low (but still non-zero) weight by the n_objects
models. We determined that this is likely a result of the correlational structure in Cell Painting data (we don't do our typical feature selection approach here since our models will induce sparsity on their own).
n_obj = pd.DataFrame(all_scores.loc[:, ["cc_cc_n_objects", "cc_all_n_objects"]])
x = pd.DataFrame(
pd.concat(
[x_train_df.assign(model="train"), x_test_df.assign(model="test")],
axis="rows")
.loc[:, ["Metadata_cell_line", "Metadata_pert_name", "model", "Cells_Number_Object_Number"]]
)
out = x.merge(n_obj, left_index=True, right_index=True)
out.head()
Metadata_cell_line | Metadata_pert_name | model | Cells_Number_Object_Number | cc_cc_n_objects | cc_all_n_objects | |
---|---|---|---|---|---|---|
HCC44 | RHOA-2 | train | 0.171472 | 0.237067 | 0.378426 | |
A549 | ATP50-1 | train | -0.428993 | 0.042507 | 0.049708 | |
A549 | MTOR-1 | train | -0.088086 | 0.125309 | 0.107420 | |
HCC44 | EZH2-1 | train | 0.634026 | 0.591529 | 0.608119 | |
HCC44 | HIF1a-1 | train | -0.128264 | 0.383806 | 0.352876 |
Thanks for that excellent summary!
Great sleuthing in this thread. It raises the question of why we don't include per-image features in our profiles, in addition to per-cell features?
Second question: i didn't understand how doses were handled, can you clarify? The choices in doing so might lead to a lot of noise.
It raises the question of why we don't include per-image features in our profiles, in addition to per-cell features?
Ah, interesting! To clarify my thinking, does this mean that specific whole image features are computed (like confluence and other things?) and then concatenated to aggregated single cell profiles?
i didn't understand how doses were handled, can you clarify? The choices in doing so might lead to a lot of noise.
Thanks for asking this question! I dug into answering this and figured out that I was using less than half of the PRISM data in this analysis. For some reason, the PRISM results are split into "primary" and "secondary" data (seen here). I was only using primary here.
In the repurposing hub data, the doses were calculated on micromoles per liter while the PRISM data does not specify the units of dose clearly. It is likely the same (given the above figures). We know that the values are not exactly the same (see below), so when we recode dose to numerical levels (to account for jitter in micromoles per liter measurements), we do introduce a bit of noise.
After taking a deeper look at the available data (thanks again for the question), I've included more data. After adding this data, the correlation strength increases:
Although it is clear there is not a 1:1 correspondence between doses tested across both datasets, I am less concerned that there are huge systematic differences.
I also have a question out to the CLUE team about dose units in the PRISM assay, so we will know for sure what we're dealing with soon.
Note: Code updated here: 2660212664d78a5864bb8f6c1916dcd749a7bad1
@gwaygenomics I'll chat w you in person now
Thanks for clarification @shntnu ! Here is ultimately what we discussed (pasted below from a private confluence page)
Stock concentration in compound plate = Stock concentration of the compound in a well of the compound plate. Typically in mM (millimolar)
Volume transferred to assay plate = Volume of the compound to be transferred from a well in the compound plate to a well in the assay plate. Typically in nL (nanoliter)
Volume in assay plate = Volume of culture (a.k.a. medium) that is present in a well of the assay plate. Typically in uL (microliter)
Assay concentration = Resulting concentration of the compound in a well of the assay plate. Typically in uM (micromolar)
Example from a LINCS Pilot 1 plate (from SQ00015201.csv obtained via Pipeline Pilot > Screening_Vial_and_Plate_Content_from_Barcode_or_Platemap)
Plate_Barcode SQ00015201 Well_Position A10 Broad_Sample BRD-K50163129-001-01-4 mg_per_ml 0.129051181571752 mmoles_per_liter 0.37037036 Solvent DMSO Vol_uL 0.05
Separately, we know that volume of media in assay plate for this particular plate was 50uL.
0.37mM * 0.05uL / 50uL = 0.37uM
I start by testing if viability readouts in the PRISM assay for A549 are similar to cell health model estimates. The DepMap data are accessed here.
The data ☝️ are available under
CC BY 4.0
and is distributed (with processing code) in #104. I also make sure to attribute Corsello et al. 2019 in a README file included in #104 (indata/
folder)In this module, I first process A549 viability data from the Cancer Dependency Map. Next, I merge the viability data with the cell health model scores applied to the same perturbations in A549.
Important Notes
Results
We see high Spearman correlation between the two viability estimates (which is pretty cool!)
cc @AnneCarpenter @shntnu