broadinstitute / cmQTL

High-dimensional phenotyping to define the genetic basis of cellular morphology
BSD 3-Clause "New" or "Revised" License
6 stars 0 forks source link

May 18 2020 Discussions (NaN-valued and zero-valued features) #43

Closed shntnu closed 2 years ago

shntnu commented 4 years ago

Let's use this thread to discuss questions from today @sasgari @jatinarora-upmc.

shntnu commented 4 years ago

@sasgari Could you summarize some of the issues you had with the NaN-valued and zero-valued features?

sasgari commented 4 years ago

basically what I see is there are about 3% of the cells that have more than 95% of the features as zero. the percentage of NaNs is negligible The zeros might be bad registration of the data or it might be from features that are expected to have zero values az a measurement. What inclined me to consider high zero percentage as the noise was the fact that only a small portion of the cells had high zero percentage (the ladder)

shntnu commented 4 years ago

Notes from Samira's email:

Beth, you mentioned that some features that shouldn’t be included in the association analysis can you remind me what features they were (I attache the features figure below)? Are there any other features that you think we should exclude?

Beth, you also mentioned Nuclei_children should be 1 for all cells, I will inspect that and will get back to the group.

Finally do you guys have suggestions about which features to use to assign cell cycle to cells and how to best summarize those features?

PastedGraphic-1

shntnu commented 4 years ago

@sasgari It's worth your reading through What do Cell Painting features mean? for an overview.

Beth, you mentioned that some features that shouldn’t be included in the association analysis can you remind me what features they were (I attache the features figure below)? Are there any other features that you think we should exclude?

I think @bethac07 was referring toNuclei_Location_MaxIntensity_X_Brightfield, described here: _Location_MaxIntensity_X, Location_MaxIntensityY: The (X,Y) coordinates of the pixel with the maximum intensity within the object.

Beth, you also mentioned Nuclei_children should be 1 for all cells, I will inspect that and will get back to the group.

A quick way to check is to query this sample of ~5000 cells generated here.

library(tidyverse)

sampled_cells <- read_csv("1.profile-cell-lines/data/cmQTLplate7-2-27-20_sampled.csv.gz")

show_summary <- function(x) {
  x %>% 
    summary() %>%
    broom::tidy() %>% 
    pivot_longer(everything()) %>% 
    knitr::kable()
}
show_summary(sampled_cells$Nuclei_Children_Cytoplasm_Count)
name value
minimum 1
q1 1
median 1
mean 1
q3 1
maximum 1

Finally do you guys have suggestions about which features to use to assign cell cycle to cells and how to best summarize those features?

(will update this comment)

shntnu commented 4 years ago

Finally do you guys have suggestions about which features to use to assign cell cycle to cells and how to best summarize those features?

We have used Nuclei_Intensity_IntegratedIntensity_DNA e.g. in the top right of this figure : ...First, a histogram of single-cell DNA content is shown for all cells from all genes/allele treatments in the cluster, indicating the overall cell cycle distribution.

image

bethac07 commented 4 years ago

@shntnu I think anything with "Location" or "Center" in it is probably bad and should be not used.

sasgari commented 4 years ago

@bethac07 and @shntnu thanks for the responses and clarifications. I will get back to you in the next few days with updated info about features and association analyses

shntnu commented 4 years ago

I haven't had time to properly document yet, but this notebook will be useful once I do https://rpubs.com/shantanu/617006