June 15 2020 Discussions (figuring out how to triage features based on reproducibility and interpretability)

shntnu commented 4 years ago

Let's use this thread to discuss any questions from today @sasgari @jatinarora-upmc.

jatinarora-upmc commented 4 years ago

thanks @shntnu.

When doing rare variant burden test: for a given feature, would you recommend checking correlation among replicates and use it if it has high correlation?
There was another suggestion to select the interpretable feature when selecting one of several correlated features - any suggestion how to do it other than manually? e.g. any special feature class?

shntnu commented 4 years ago

When doing rare variant burden test: for a given feature, would you recommend checking correlation among replicates and use it if it has high correlation?

I wouldn't filter based on that but certainly good to report replicate reproducibility of the features at the aggregate level. I have not thought through whether dropping features would bias the analysis in any way, so for now, its best to keep them but definitely report them. I'll post here on how to do that.

There was another suggestion to select the interpretable feature when selecting one of several correlated features - any suggestion how to do it other than manually? e.g. any special feature class?

This is not straightforward to do in an automated fashion, but once you filter down to a handful of features that you are going to probe, you can follow a procedure like this

for each feature, create a list of features that are strongly correlated features with it
run your tests for each of these features to make sure your observations are the same for each as it was for the original feature
share this list with us and we can provide some insights. E.g. some Zernike features strongly correlate with more directly interpretable features like elongation and compactness.

shntnu commented 4 years ago

This code snippet computes replicate correlations for each feature. The result is attached. A somewhat sparse documentation of the function is here

plates <-
  c(
    "BR00106708",
    "BR00106709",
    "BR00107338",
    "BR00107339",
    "cmqtlpl1.5-31-2019-mt",
    "cmqtlpl261-2019-mt"
  )

profile_files <-
  file.path("1.profile-cell-lines/profiles/",
            paste(plates, "augmented.csv", sep = "_"))

profiles <- profile_files %>% map_df(read_csv)

replicate_correlation_values <- 
  cytominer::replicate_correlation(
    profiles, 
    names(profiles) %>% str_subset("Cells_|Cytoplasm_|Nuclei_"), 
    strata = "Metadata_line_ID", 
    replicates = 8, 
    split_by = "Metadata_Plate",
    cores = 8)

# drop Manders, Costes, and features that measure Z axis
replicate_correlation_values %>% 
  filter(!str_detect(variable, "Costes|Manders|_Z_|_Z")) %>%
  write_csv("replicate_correlation_values.csv")

replicate_correlation_values.txt

And here's a quick way to see how it could be useful when inspecting Zernike feature of the cell.

replicate_correlation_values %>% 
  filter(str_detect(variable, "Cells_AreaShape_Zernike")) %>% 
  separate("variable", c("x1", "x2", "x3", "n", "m")) %>% 
  ggplot(aes(n, m, size = median, label = sprintf("%.2f", median))) + 
  geom_label()

So, very roughly, I'd be worried if you had 8_0 or 8_4 showing up as having a strong genetic basis.

broadinstitute / cmQTL

June 15 2020 Discussions (figuring out how to triage features based on reproducibility and interpretability) #44