broadinstitute / cmQTL

High-dimensional phenotyping to define the genetic basis of cellular morphology
BSD 3-Clause "New" or "Revised" License
6 stars 0 forks source link

June 15 2020 Discussions (figuring out how to triage features based on reproducibility and interpretability) #44

Closed shntnu closed 2 years ago

shntnu commented 4 years ago

Let's use this thread to discuss any questions from today @sasgari @jatinarora-upmc.

jatinarora-upmc commented 4 years ago

thanks @shntnu.

  1. When doing rare variant burden test: for a given feature, would you recommend checking correlation among replicates and use it if it has high correlation?
  2. There was another suggestion to select the interpretable feature when selecting one of several correlated features - any suggestion how to do it other than manually? e.g. any special feature class?
shntnu commented 4 years ago
  • When doing rare variant burden test: for a given feature, would you recommend checking correlation among replicates and use it if it has high correlation?

I wouldn't filter based on that but certainly good to report replicate reproducibility of the features at the aggregate level. I have not thought through whether dropping features would bias the analysis in any way, so for now, its best to keep them but definitely report them. I'll post here on how to do that.

  • There was another suggestion to select the interpretable feature when selecting one of several correlated features - any suggestion how to do it other than manually? e.g. any special feature class?

This is not straightforward to do in an automated fashion, but once you filter down to a handful of features that you are going to probe, you can follow a procedure like this

shntnu commented 4 years ago

This code snippet computes replicate correlations for each feature. The result is attached. A somewhat sparse documentation of the function is here

plates <-
  c(
    "BR00106708",
    "BR00106709",
    "BR00107338",
    "BR00107339",
    "cmqtlpl1.5-31-2019-mt",
    "cmqtlpl261-2019-mt"
  )

profile_files <-
  file.path("1.profile-cell-lines/profiles/",
            paste(plates, "augmented.csv", sep = "_"))

profiles <- profile_files %>% map_df(read_csv)

replicate_correlation_values <- 
  cytominer::replicate_correlation(
    profiles, 
    names(profiles) %>% str_subset("Cells_|Cytoplasm_|Nuclei_"), 
    strata = "Metadata_line_ID", 
    replicates = 8, 
    split_by = "Metadata_Plate",
    cores = 8)

# drop Manders, Costes, and features that measure Z axis
replicate_correlation_values %>% 
  filter(!str_detect(variable, "Costes|Manders|_Z_|_Z")) %>%
  write_csv("replicate_correlation_values.csv")

replicate_correlation_values.txt

And here's a quick way to see how it could be useful when inspecting Zernike feature of the cell.

replicate_correlation_values %>% 
  filter(str_detect(variable, "Cells_AreaShape_Zernike")) %>% 
  separate("variable", c("x1", "x2", "x3", "n", "m")) %>% 
  ggplot(aes(n, m, size = median, label = sprintf("%.2f", median))) + 
  geom_label()

image

So, very roughly, I'd be worried if you had 8_0 or 8_4 showing up as having a strong genetic basis.