July 20 2020 Discussions (cell count confounders, cell health predictions)

shntnu commented 4 years ago

Let's use this thread to discuss any questions from today @jatinarora-upmc.

shntnu commented 4 years ago

I am copying @gwaygenomics's question here

From Gregory Way to Everyone: (11:55 AM)  One thing regarding interpretation of morphology features: is it a bad idea to use the morphology feature selection approach to find candidate genes but then run followup tests on these 6 or so specific genes but use all morphology features 

shntnu commented 4 years ago

@jatinarora-upmc Recap of Zernike: See https://github.com/broadinstitute/cmQTL/issues/32#issuecomment-648410790

gwaybio commented 4 years ago

I also had another question about nearest gene to GWAS signals. Do we see any of these pop up? How about GWAS gene neighborhoods?

jatinarora-upmc commented 4 years ago

Quick notes from today's meeting on rare variant burden test on morphology features:

check SLFN12 (having significant association with Cytoplasm_Areashape_Zernike_3_1 in any cells) in isolate cells also
interaction between variant burden in a gene and ipsc source tissue, or donor ancestry
cross check associations with images and live cell counter
include doubling time as a covariate in association analysis
can also include total number of cells in well as a proxy for cell cycle

jatinarora-upmc commented 4 years ago

I also had another question about nearest gene to GWAS signals. Do we see any of these pop up? How about GWAS gene neighborhoods?

@gwaygenomics thanks for bringing this up. Actually this is in to-do list once we are done with common and rare variant associations. I was wondering if cell health can also be incorporated as a covariate.

jatinarora-upmc commented 4 years ago

I am copying @gwaygenomics's question here

From Gregory Way to Everyone: (11:55 AM)  > One thing regarding interpretation of morphology features: is it a bad idea to use the morphology feature selection approach to find candidate genes but then run followup tests on these 6 or so specific genes but use all morphology features

not a bad idea as we saw, while a feature has one or two associated genes, a single gene might impact many features. I think we could do this to check if super correlated features are affected by same genes - as a sanity check in the end.

jatinarora-upmc commented 4 years ago

@bethac07 @shntnu hi Beth, Shantanu, could you help me to get live cell counter information per well?

shntnu commented 4 years ago

Did you mean just cell count (vs fraction of live cells?) For the former see *_count.csv in https://github.com/broadinstitute/cmQTL/tree/master/1.profile-cell-lines/profiles. For the latter, we'd need to use models from https://github.com/broadinstitute/cell-health but it will need some effort to do that. If the latter, can you remind me of the context?

jatinarora-upmc commented 4 years ago

@shntnu actually, i meant the latter, fraction of live cells. The idea was to know how many good cells we have in the condition like this image. Actually, during last presentation, i wanted to ask your opinion to include cell health as a covariate in my model.

shntnu commented 4 years ago

@jatinarora-upmc Indeed fraction of live cells could be estimated using the Cell Health models like this.

@gwaygenomics What do you feel about Jatin using these models directly? There's no way to evaluate (in this dataset) but we'll know if it's totally off (e.g. if we get crazy numbers). The results could well be totally off the charts because the models were trained on a very different cell line. But certainly worth testing it out IMO (assuming it will take Jatin no more than 2 days to apply and test)

gwaybio commented 4 years ago

@gwaygenomics What do you feel about Jatin using these models directly? There's no way to evaluate (in this dataset) but we'll know if it's totally off (e.g. if we get crazy numbers). The results could well be totally off the charts because the models were trained on a very different cell line. But certainly worth testing it out IMO (assuming it will take Jatin no more than 2 days to apply and test)

Sounds cool! @jatinarora-upmc and I chatted separately on slack (sorry for not posting my thoughts earlier) but I will summarize below:

We can use models to "to filter out wells with many dying cells or cells in last phase of apoptosis" (from Jatin)
"here is a relatively high performing model to figure out percentage of dead cells: https://github.com/broadinstitute/cell-health/blob/master/3.train/models/cell_health_median_target_vb_percent_dead_only_shuffle_False_transform_raw.joblib" (form me)
The plan is:
- Jatin will give me the matrix I need (normalized, but non-feature selected matrix of well x morphology features)?
- I will create a notebook applying the % dead only model to this matrix and output predictions
- I will create some preliminary visualizations describing score distributions, Jatin will interpret and followup

I won't be able to get to this for a couple days though, so let's brainstorm if I can do anything else in this time period (but please be gentle and weary of feature creep!)

shntnu commented 4 years ago

Fantastic!

The only other request is: also test a couple of well-performing models that can be easily validated by using CellProfiler features. From the list below, I'd go with cc_all_n_objects and cc_all_nucleus_area_mean (feature mapping is here). Does that sound reasonable @gwaygenomics ?

gwaybio commented 4 years ago

that's perfect - will do!

gwaybio commented 4 years ago

I started this analysis today and ran into a road block. It turns out there are 506 features measured in the Cell Health project that are not measured in the cmQTL project. Many of these features have nonzero coefficients for the three models we proposed using. The cmQTL data I am using (Jatin sent over a .tab file on dropbox) has 3,582 features. The missing features are all texture and correlation features.

Unless we can resolve this feature difference, then the Cell Health models can not easily be applied to the cmQTL data and we should abandon this analysis.

gwaybio commented 4 years ago

I added my progress in #51 - if we can resolve this, then outputting predictions can happen very quickly

bethac07 commented 4 years ago

Many of those features may still actually be measured*, just have different names, since IIRC CellHealth was CellProfiler 2 and cmQTL is definitely CellProfiler 3. Is there a list of the unique features from each set somewhere? We may be able to do a fair amount of cross referencing.

= The implementation of Texture is pretty different between CellProfiler 2 and 3, but one would HOPE anyway that even with a different implementation, Texture at a given angle and scale is still useful no matter the implementation.

shntnu commented 4 years ago

Let's split off the cell health-related discussion to this thread https://github.com/broadinstitute/cmQTL/issues/53

jatinarora-upmc commented 4 years ago

@gwaygenomics @bethac07 @shntnu just following up on cell health readouts, was it feasible to align the features?

broadinstitute / cmQTL

July 20 2020 Discussions (cell count confounders, cell health predictions) #47