broadinstitute / cmQTL

High-dimensional phenotyping to define the genetic basis of cellular morphology
BSD 3-Clause "New" or "Revised" License
6 stars 0 forks source link

Clarifications about Cell Painting data #32

Open shntnu opened 4 years ago

shntnu commented 4 years ago

This thread is to address general questions about Cell Painting data. Discuss dataset-specific and analysis-specific issues in a separate thread.

@sasgari asked:

Besides the missing features, some features have zero or negative values? I am not sure if these zeros and/or negative values valid measurements or if they should be treated as missing.

cc @jatinarora-upmc

shntnu commented 4 years ago

Besides the missing features, some features have zero or negative values? I am not sure if these zeros and/or negative values valid measurements or if they should be treated as missing.

Feature values can indeed be negative. In fact they can have very different distributions at the single-cell level e.g. see this figure

image

sasgari commented 4 years ago

Thanks @shntnu!

shntnu commented 4 years ago

@sasgari Note that this is for single-cell level data of course. Aggregated or "psuedo-bulk" profiles will have a different distribution (they would have sampling distributions of the corresponding statistics e.g. mean or median)

shntnu commented 4 years ago

@jatinarora-upmc asked - what are Costes features?

These are features used to measure the correlation between channels (in Cell Painting, each channel corresponds to one stain, except for the AGP channel, which corresponds to two stains).

There are many methods to measure correlation between channels. The Costes' method evaluates the correlation in pixels below each threshold in the data, and then selects the threshold with the minimum correlation or highest threshold with a non-positive correlation (from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5200903/).

shntnu commented 4 years ago

@jatinarora-upmc asked - Nucleus is identified using DNA channel, but cell is identified using nucleus and cytoplasmic RNA channel. I wonder why cells are not identified using plasma membrane channel?

From: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5223290/:

First, the nuclei are identified from the Hoechst image because it is a high-contrast stain for a well-separated organelle; subsequently, the nucleus along with an appropriate channel is used to delineate the cell body49. We have found the SYTO 14 image is the most amenable for finding cell edges, as it has fairly distinct boundaries between touching cells.

We did use AGP in the past but switched to RNA later.

shntnu commented 4 years ago

What do Cell Painting features mean? Learn more here.

jatinarora-upmc commented 4 years ago

@shntnu the number of adjacent neighbors (Cells_Neighbors_NumberOfNeighbors_Adjacent) for isolate cells is 0, but Cells_Neighbors_PercentTouching_Adjacent is not 0. This is confusing as both should be 0. Could you please help us to understand this? Keeping @sasgari also in loop.

shntnu commented 4 years ago

I assume you are looking at single cells? Because that won't hold at the aggregate level.

For single cells data, I looked at the sample of 4994 cells in this repo, I found this anomaly is observed only once. Do you see that more often? If so, I can probe further.

sampled_cells %>%
  select(
    Cells_Neighbors_NumberOfNeighbors_Adjacent,
    Cells_Neighbors_PercentTouching_Adjacent
  ) %>%
  filter(
    xor(
      Cells_Neighbors_NumberOfNeighbors_Adjacent == 0,
      Cells_Neighbors_PercentTouching_Adjacent == 0
    )
  ) %>%
  pivot_longer(everything())
name value
Cells_Neighbors_NumberOfNeighbors_Adjacent 0.0000000
Cells_Neighbors_PercentTouching_Adjacent 0.5747126
jatinarora-upmc commented 4 years ago

Actually, I averaged the data from single cells to donor level for each plate individually, and Cells_Neighbors_PercentTouching_Adjacent is non-0 for isolate cells on all plates.

shntnu commented 4 years ago

Actually, I averaged the data from single cells to donor level for each plate individually, and Cells_Neighbors_PercentTouching_Adjacent is non-0 for isolate cells on all plates.

That's definitely odd, but I wonder if it might be something in your code? As you see below, that anomaly occurs only once in the 287 isolated cells (I can't explain that without more digging, but it is certainly is a rare event; < 0.5% in this sample)

sampled_cells %>% tally()
n
4994
sampled_cells %>%
  select(
    Cells_Neighbors_NumberOfNeighbors_Adjacent,
    Cells_Neighbors_PercentTouching_Adjacent
  ) %>%
  filter(Cells_Neighbors_NumberOfNeighbors_Adjacent == 0) %>%
  group_by(Cells_Neighbors_PercentTouching_Adjacent) %>%
  tally() 
Cells_Neighbors_PercentTouching_Adjacent n
0.0000000 287
0.5747126 1
jatinarora-upmc commented 4 years ago

You are right. I checked one plate, cmqtlpl261-2019, and it has ~22k isolate cells (Cells_Neighbors_NumberOfNeighbors_Adjacent == 0) and 145 cells with anomaly (Cells_Neighbors_PercentTouching_Adjacent != 0) present in all donor/cell lines. When I average the single cell level features to donor level, Cells_Neighbors_PercentTouching_Adjacent becomes non-0. So, all set for now. BTW, what is the reason for this anomaly?

shntnu commented 4 years ago

I think it's to do with their position. This one cell seems to lie on the edge of the image and something funky must be happening to the calculation of the percentage. You can safely ignore this case (i.e. consider Cells_Neighbors_NumberOfNeighbors_Adjacent to be correct, and ignore Cells_Neighbors_PercentTouching_Adjacent)

sampled_cells %>% 
  filter(Cells_Neighbors_NumberOfNeighbors_Adjacent == 0) %>%
  ggplot(aes(Cells_Neighbors_PercentTouching_Adjacent == 0,
             Cells_Location_Center_X)) + 
  geom_boxplot()

image

bethac07 commented 4 years ago

So based on this conversation, this is my guess- Those cells DO have neighbors, but those neighbors are cells that are ultimately excluded for touching the edge of the image, so the cell does indeed have 1) some % of its border touching another cell but also 2) 0 "accepted" neighbors.

On Fri, Jun 12, 2020 at 9:04 PM Shantanu Singh notifications@github.com wrote:

I think it's to do with their position. This one cell seems to lie on the edge of the image and something funky must be happening to the calculation.

sampled_cells %>% filter(Cells_Neighbors_NumberOfNeighbors_Adjacent == 0) %>% ggplot(aes(Cells_Neighbors_PercentTouching_Adjacent == 0, Cells_Location_Center_X)) + geom_boxplot()

[image: image] https://user-images.githubusercontent.com/1210428/84556359-24568580-acf0-11ea-95ad-efcf150f07a8.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cmQTL/issues/32#issuecomment-643545892, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTI72ZZK4ZIMNNE7KRWY7TRWLGALANCNFSM4LVLEQLA .

-- Beth Cimini, PhD Senior Computational Biologist, Imaging Platform Broad Institute 415 Main St Room 5011 Cambridge, MA 02142 Current office number- (617) 714-8189 Pronouns - She/her/hers I will sometimes send or respond to emails outside of my local office hours, but I never expect responses outside of your local office hours.

shntnu commented 4 years ago

Thanks @bethac07

Here's the measureobjectneighbors documentation for our reference.

NumberOfNeighbors: Number of neighbor objects. PercentTouching: Percent of the object’s boundary pixels that touch neighbors, after the objects have been expanded to the specified distance. Note that this measurement is only available if you use the same set of objects for both objects and neighbors.

@jatinarora-upmc Given that this is an edge case (literally as well!), it doesn't really matter how we handle it. But if you wanted to be really rigorous, you'd modify the definition of isolated to be Cells_Neighbors_PercentTouching_Adjacent == 0

shntnu commented 4 years ago

Soumya had asked what Zernike features mean. Here are my quick notes that I sent via email.


Briefly, these features represent subtle properties of shape, and the higher the index, the more nuanced the shape (e.g. Zernike 9 is more nuanced than Zernike 8).

Less briefly: You can represent any 2D function as a linear combination of the orthogonal basis defined by Zernike polynomials (all the way below). Both, a cell, as well as its shape can be thought of as a 2D function.

Take any cell below, and

  1. look at the top image: you can think of this as a 2D function where X,Y is the location of a pixel, and f(X,Y) is the brightness of the Pixel.
  2. look at the corresponding cell in the bottom image: you can also think of this as a 2D function where X,Y is the location of a pixel, and f(X,Y) is 1 if you are inside the cell and 0 if you are outside the cell.

Cells AreaShape Zernike 9 1 is a shape feature, so you have shape –  a binary (0 or 1) 2D function – that needs to be decomposed into its components using the Zernike basis. CellProfiler does that for you and gives you the coefficients as shape features.

Another intuition that's helpful is the regular notion of moments in stats: you can use higher-order moments to describe more nuanced aspects of a distribution; same thing with shape.

Yet another (precise) intuition is that you are doing a power series expansion of a 2D function



@AnneCarpenter's explanation from https://github.com/broadinstitute/cmQTL/issues/63#issuecomment-742142983

Q2: Here is a guide to the Zernikes: https://en.wikipedia.org/wiki/File:Zernike_polynomials2.png Zernike0_0 should honestly have almost perfect correlation with one of the more commonly named shape metrics because it's really asking whether the cell matches a circle shape. For 3_1 you look at that pyramid for the one that says Z with a 1 on top and a 3 on the bottom (I think). You can see it has a red and blue stripe at the edges, and a red and blue blob in the middle. What this means: picture the shape of the cell superimposed on top… it will score high for this Zernike the more blue is covered and the more red you see - our cells aren't allowed to have holes in them, so i can imagine two cell shapes that would score highly: one is almost a perfect circle but just a little flattened at the red side. The other would be almost a crescent such that the middle red blob is exposed (but it’s not a great fit because a big chunk wouldn’t align well). 6_4 isn’t shown but you can follow the right hand side of the pyramid and see it would be mostly a circle with wiggly edges (probably not far off from a circle!). I'm a bit surprised that they'd be anticorrelated to 0_0, really.

jatinarora-upmc commented 4 years ago

Hi @shntnu , I was wondering if i could skip RadialDistribution features (all or some such as FractAd), as they show distribution of total intensity, but i can not decide since i don’t have much functional interpretation of these features. What would be your recommendation?

shntnu commented 4 years ago

RadialDistribution features have been pretty informative in past experiments so I would not advise dropping. See https://forum.image.sc/t/radial-distribution-module/17272 for an explanation.