Process new batch 2020_03_05_Batch6

shntnu commented 4 years ago

Conclusion: We decided to repeat this plate. This comment https://github.com/broadinstitute/cmQTL/issues/30#issuecomment-618480494 made a strong argument for doing so.

Images should be copied to /imaging/analysis/2018_06_05_cmQTL/2020_03_05_Batch6 See https://broadinstitute.atlassian.net/wiki/spaces/IP/pages/800424256/Process+for+exporting+images+from+the+CDOT+microscopes+using+Harmony for instructions

mtegtmey commented 4 years ago

All done!

shntnu commented 4 years ago

Images are currently being copied to S3

jatinarora-upmc commented 4 years ago

hi @shntnu , is the data for this 7th plate ready and accessible?

shntnu commented 4 years ago

hi @shntnu , is the data for this 7th plate ready and accessible?

Not yet - I'll look up the status and loop back

shntnu commented 4 years ago

@jatinarora-upmc This should be ready by Mar 25

jatinarora-upmc commented 4 years ago

@shntnu alright. thanks

shntnu commented 4 years ago

@jatinarora-upmc Nearly all files for the new batch are available in #31. The colony and isolated versions are pending. Perhaps you could start inspecting this data to get started? I noticed that a lot of wells have very few cells, but not inspected further.

shntnu commented 4 years ago

The colony and isolated versions are pending.

I have skipped creating these files because you are currently not using them.

(we need to fix #16 before we can reliably created those files)

jatinarora-upmc commented 4 years ago

hi @shntnu , i am going through the this plate 7. I noticed this plate has ~23% of cells with >5% missing features, while this rate was 3-7% on other plates. Any idea what could be causing this?

shntnu commented 4 years ago

The cell counts are definitely very low for that plate (cmqtlpl1.5-31-2019-mt)


plates <- c("cmqtlpl1.5-31-2019-mt",
            "cmqtlpl261-2019-mt",
            "BR00106708",
            "BR00106709",
            "BR00107338",
            "BR00107339",
            "cmQTLplate7-2-27-20")

counts <- 
  map_df(
    plates,
    function(plate) {
      read_csv(
        file.path("profiles", glue("{plate}_count.csv"))
      ) %>%
        distinct()
    }
  )

metadata <- 
  map_df(
    plates,
    function(plate) {
      read_csv(
        file.path("profiles", glue("{plate}_augmented.csv")),
        col_types = cols_only(
          Metadata_Plate = "c",
          Metadata_Well = "c",
          Metadata_Assay_Plate_Barcode = "c",
          Metadata_Plate_Map_Name = "c",
          Metadata_well_position = "c",
          Metadata_plating_density = "c",
          Metadata_line_ID = "c"
        )
      ) %>%
        distinct()
    }
  )

counts %<>% inner_join(metadata)

counts %>% 
  ggplot(aes(Metadata_Plate, Count_Cells)) + geom_boxplot() + coord_flip()

Still diagnosing…

shntnu commented 4 years ago

And looking at the plate alone, definitely something amiss

jatinarora-upmc commented 4 years ago

@shntnu any idea if the root of this problem lie somewhere in processing images (e.g. segmentation) or wet lab part on the plate?

mtegtmey commented 4 years ago

To me, this would likely be an issue on the wetlab side of things.

If you feel like this data is unusable through the QC steps let me know and I can coordinate with Emily to see if she had any notes about mishaps with this plate.

On Apr 10, 2020, at 5:13 PM, Jatin Arora notifications@github.com wrote:

@shntnu any idea if the root of this problem lie somewhere in processing images (e.g. segmentation) or wet lab part on the plate?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.

shntnu commented 4 years ago

More probing

I remove all annotations to reduce clutter

The 3 lines are 25th, 50th, 75th percentile of cell counts across all wells, and their values are 774, 1629, 2752 respectively

(q25 <- quantile(counts$Count_Cells, .25, names = FALSE))
(q50 <- quantile(counts$Count_Cells, .50, names = FALSE))
(q75 <- quantile(counts$Count_Cells, .75, names = FALSE))

counts %>%
  ggplot(aes(fct_reorder(Metadata_line_ID, Count_Cells), Count_Cells)) + 
  geom_boxplot() + 
  geom_hline(yintercept = q25, color = "gray") +
  geom_hline(yintercept = q50, color = "gray") +
  geom_hline(yintercept = q75, color = "gray") +
  facet_wrap(~Metadata_Plate, scales = "free_x") +
  theme_void()

shntnu commented 4 years ago

To me, this would likely be an issue on the wetlab side of things. If you feel like this data is unusable through the QC steps let me know and I can coordinate with Emily to see if she had any notes about mishaps with this plate.

That would be great, @mtegtmey. I haven't looked into the images but would be good know if Emily has some notes.

mtegtmey commented 4 years ago

After talking with Emily, she mentioned something about a change in pressure on the liquid handler when adding PFA to the samples. However she said she stopped it about halfway through, which would account for the ubiquitous drop in cell counts (only half the plate would be low, hypothetically). The most likely issue is a mis-calculation of the cell counts or the time which elapsed during the upstream cell culture work which caused more cells to sink to the bottom of the plate. Even when re-suspending them prior to plating they may not have all be mixed well.

shntnu commented 4 years ago

Thanks @mtegtmey - glad to know there's some explanation for this. Meanwhile, I'm making some notes below in case someone from our end can dig into the images

Goal: To figure out whether there is anything amiss in the images (other than low cell count) that may have led to the issue that @jatinarora-upmc described https://github.com/broadinstitute/cmQTL/issues/30#issuecomment-611036214

I ran the analysis and illum pipelines (steps here). I did not run the QC pipeline.
The cell counts of the plate are very low
Here is the analysis folder of a sample image
Download this sample image like this parallel aws s3 cp s3://imaging-platform/projects/2018_06_05_cmQTL/2020_03_05_Batch6/images/cmQTLplate7-2-27-20__2020-03-04T16_40_12-Measurement1/Images/r01c01f01p01-ch{1}sk1fk1fl1.tiff . ::: 1 2 3 4 5 6
Here is the load data
Here is the channel mapping from load_data.csv

name	value
FileName_OrigRNA	r01c01f01p01-ch3sk1fk1fl1.tiff
FileName_OrigER	r01c01f01p01-ch4sk1fk1fl1.tiff
FileName_OrigAGP	r01c01f01p01-ch2sk1fk1fl1.tiff
FileName_OrigMito	r01c01f01p01-ch1sk1fk1fl1.tiff
FileName_OrigBrightfield	r01c01f01p01-ch6sk1fk1fl1.tiff
FileName_OrigDNA	r01c01f01p01-ch5sk1fk1fl1.tiff

Here is the channel map from Index.idx.xml

    <Map>
      <Entry ChannelID="1">
        <FlatfieldProfile>{Background: {Character: NonFlat, Mean: 289.00625, NoiseConst: 5.897988, NonFlatness: {Corrected: 0.090252958, Original: 0.45296502, Random: 0.029437569}, Profile: {Coefficients: [[1.1471], [-0.0114, -0.0195], [-0.8175, 0.248, -0.8125], [0.0229, 0.4083, -0.1066, -0.0519], [-0.8331, -0.489, 0.6704, -0.2165, -0.4433]], Dims: [2160, 2160], Origin: [1079.5, 1079.5], Scale: [0.00046296296, 0.00046296296], Type: Polynomial}, Quality: 1.0}, Channel: 1, ChannelName: Alexa 647, Foreground: {Character: NonFlat, NonFlatness: {Original: 0.71276712, Random: 0.062127856}, Profile: {Coefficients: [[1.2259], [0.0932, -0.3422], [-1.1153, 0.4186, -2.0113], [0.4411, 1.4414, -0.7732, 0.82], [-0.2503, -0.342, 0.0903, -0.0531, 2.9682]], Dims: [2160, 2160], Origin: [1079.5, 1079.5], Scale: [0.00046296296, 0.00046296296], Type: Polynomial}, Quality: 1.0}, Version: Acapella:2013}</FlatfieldProfile>
      </Entry>
      <Entry ChannelID="2">
        <FlatfieldProfile>{Background: {Character: NonFlat, Mean: 326.66138, NoiseConst: 7.6508897, NonFlatness: {Corrected: 0.19489469, Original: 0.63753444, Random: 0.022788157}, Profile: {Coefficients: [[1.1815], [-0.2002, -0.0244], [-1.1124, 0.3297, -1.2384], [0.4434, 0.7452, -0.2116, -0.0575], [0.2517, -0.3976, 0.6822, -0.0854, 0.5229]], Dims: [2160, 2160], Origin: [1079.5, 1079.5], Scale: [0.00046296296, 0.00046296296], Type: Polynomial}, Quality: 1.0}, Channel: 2, ChannelName: Alexa 568, Foreground: {Character: NonFlat, NonFlatness: {Original: 0.75699329, Random: 0.045633834}, Profile: {Coefficients: [[1.2525], [-0.0642, -0.2533], [-1.2709, 0.2196, -2.2142], [0.5299, 1.3018, -0.5135, 0.3411], [-0.6716, 0.497, 1.6567, 0.1297, 2.7825]], Dims: [2160, 2160], Origin: [1079.5, 1079.5], Scale: [0.00046296296, 0.00046296296], Type: Polynomial}, Quality: 1.0}, Version: Acapella:2013}</FlatfieldProfile>
      </Entry>
      <Entry ChannelID="3">
        <FlatfieldProfile>{Background: {Character: Null, Mean: NaN, Profile: {Type: Identity}, Quality: 0.25}, Channel: 3, ChannelName: 488 long, Foreground: {Character: NonFlat, NonFlatness: {Original: 0.76508814, Random: 0.038975296}, Profile: {Coefficients: [[1.2777], [0.0452, -0.2261], [-1.542, 0.3696, -2.0432], [0.5051, 0.819, -0.6026, 0.3894], [-0.285, -0.5768, 0.7885, 0.0575, 1.5303]], Dims: [2160, 2160], Origin: [1079.5, 1079.5], Scale: [0.00046296296, 0.00046296296], Type: Polynomial}, Quality: 1.0}, Version: Acapella:2013}</FlatfieldProfile>
      </Entry>
      <Entry ChannelID="4">
        <FlatfieldProfile>{Background: {Character: Null, Mean: NaN, Profile: {Type: Identity}, Quality: 0.25}, Channel: 4, ChannelName: Alexa 488, Foreground: {Character: NonFlat, NonFlatness: {Original: 0.79320383, Random: 0.038657013}, Profile: {Coefficients: [[1.279], [0.0522, -0.1535], [-1.7208, 0.2748, -1.8084], [0.2423, 0.6396, -0.5424, 0.3576], [0.4353, -0.567, 1.0237, 0.3775, 0.2043]], Dims: [2160, 2160], Origin: [1079.5, 1079.5], Scale: [0.00046296296, 0.00046296296], Type: Polynomial}, Quality: 1.0}, Version: Acapella:2013}</FlatfieldProfile>
      </Entry>
      <Entry ChannelID="5">
        <FlatfieldProfile>{Background: {Character: NonFlat, Mean: 432.49347, NoiseConst: 1.3, NonFlatness: {Corrected: 0.13452815, Original: 0.71967578, Random: 0.018331587}, Profile: {Coefficients: [[1.2248], [-0.2387, 0.041], [-1.0813, 0.2014, -1.2981], [0.5677, 0.2221, 0.0283, -0.1704], [-1.7485, -0.5801, 0.651, 0.0083, -0.7362]], Dims: [2160, 2160], Origin: [1079.5, 1079.5], Scale: [0.00046296296, 0.00046296296], Type: Polynomial}, Quality: 1.0}, Channel: 5, ChannelName: HOECHST 33342, Foreground: {Character: NonFlat, NonFlatness: {Original: 1.0146352, Random: 0.059520878}, Profile: {Coefficients: [[1.3021], [-0.1311, -0.0537], [-0.7414, 1.1962, -1.3613], [1.2206, 0.9062, -0.9075, -0.4483], [-6.2633, -3.6864, 0.8118, -1.5641, -4.3353]], Dims: [2160, 2160], Origin: [1079.5, 1079.5], Scale: [0.00046296296, 0.00046296296], Type: Polynomial}, Quality: 1.0}, Version: Acapella:2013}</FlatfieldProfile>
      </Entry>
      <Entry ChannelID="6">
        <FlatfieldProfile>{Background: {Character: Null, Profile: {Type: Identity}, Quality: 1}, Channel: 6, ChannelName: Brightfield CP, Foreground: {Character: Flat, Profile: {Type: Identity}, Quality: 1}, Version: Acapella:2013}</FlatfieldProfile>
      </Entry>
    </Map>
    <Map>

shntnu commented 4 years ago

Beth can continue to add notes in this issue if she is able to inspect this data. But otherwise, nothing more to do here.

bethac07 commented 4 years ago

The cell counts for this plate are indeed very low by eye, just looking at the plate.

Screenshot below is one randomly selected field (3) from each well, then all the wells laid out as they would be on the plate- black is background, cells are a mix of red, green, and cyan.

You'll see that by eye >1/2 the wells are almost or completely black.

shntnu commented 4 years ago

just looking at the plate.

(for our notes, Beth used the workflow described here)

shntnu commented 4 years ago

In https://github.com/broadinstitute/cmQTL/pull/40 (this notebook), I randomly sampled 5000 cells from this plate; these are the number of NA cells per feature, for the top few features

name	number_of_na
Nuclei_Correlation_Costes_AGP_Mito	340
Cells_Correlation_Costes_ER_Mito	328
Cytoplasm_Correlation_Costes_ER_Mito	328
Cytoplasm_Correlation_Costes_AGP_Mito	319
Nuclei_Correlation_Costes_Mito_AGP	317
Nuclei_Correlation_Costes_RNA_Mito	314
Nuclei_Correlation_Costes_ER_Mito	313
Cells_Correlation_Costes_RNA_Mito	303
Cells_Correlation_Costes_AGP_Mito	302

All features with number_of_na > 10 were Correlation features.

These features had apparently nothing to do with cell size, so there's something else going on.

    ## 
    ## Call:
    ## lm(formula = Nuclei_Correlation_Costes_AGP_Mito ~ ., data = data_matrix)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -0.4335 -0.2968 -0.2689 -0.2354  3.9613 
    ## 
    ## Coefficients:
    ##                            Estimate Std. Error t value Pr(>|t|)  
    ## (Intercept)               0.0190576  0.0487076   0.391   0.6956  
    ## Cells_AreaShape_Area     -0.0009254  0.0003775  -2.452   0.0143 *
    ## Cytoplasm_AreaShape_Area  0.0009250  0.0003775   2.450   0.0143 *
    ## Nuclei_AreaShape_Area     0.0008028  0.0003518   2.282   0.0226 *
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.999 on 4990 degrees of freedom
    ## Multiple R-squared:  0.002636,   Adjusted R-squared:  0.002036 
    ## F-statistic: 4.395 on 3 and 4990 DF,  p-value: 0.004283

bethac07 commented 4 years ago

Are they all Costes features, specifically? Because we typically throw those out downstream anyway.

shntnu commented 4 years ago

Are they all Costes features, specifically? Because we typically throw those out downstream anyway.

Yes, all the top most frequent ones are are Costes.

I didn't know we throw them out; I thought it was only Manders and RWC as documented here

bethac07 commented 4 years ago

We've had multiple Slack discussions about throwing them out, but were waiting for a final decision from the profilers - but we-the-assay-devs have seen them be a problem repeatedly in other sets.

Tagged you there to remind you of context.

bethac07 commented 4 years ago

It also looks like from a search of my email that Greg typically now removes them in pycytominer- see excerpt below from the resistance mechanisms GH issue 40

I also removed costes (and other extreme outlier) features from all profiles. This made the profiles look much cleaner 🎉 We will continue dropping these types of features in future projects.

shntnu commented 4 years ago

I inspected the features in this notebook and found that all the top most frequent NA-valued features were Costes features.

This features is poorly behaved https://github.com/cytomining/profiling-handbook/pull/52 and we have decided to drop them going forward.

Its not clear why this plate had so many more features with NA values but its possible that for whatever reason this one just ended up with a long tail of NA features (only a few cells are NA, but for many features)

shntnu commented 4 years ago

Thanks again @bethac07 for digging into this! We are all set here.

shntnu commented 4 years ago

Oops – not quite done yet :) @jatinarora-upmc have a look at this notebook and LMK if it makes sense (sorry, ran out of time to annotate it).

jatinarora-upmc commented 4 years ago

I did the qc (the same as for other 6 plates) on this plate 7. In attached screenshot, it shows that I start with 303612 cells and 4296 features. The features decreased to 3578 post qc. This decrease includes the removal of blacklisted, costes/correlation features etc. These numbers for features are fine, and looks like for other plates. plate7_qc_stat The point to note is that there are 1443 cells (303612-302169) which had missing measurement (NA) for one or more features - which is not the case on other plates. Overall, as you all mentioned many times, although the number of total cells 302,169 (post-qc) is low compared to other plates, but it seems we have good number of features (3578) measured for them.

shntnu commented 4 years ago

We are all set here, I think :)

broadinstitute / cmQTL

Process new batch 2020_03_05_Batch6 #30