Adding Level 3-5 Cell Painting Data Questions

gwaybio commented 4 years ago

I am in the process of adding level 3-5 profiles to this repo (using git lfs). I will use this issue to document various questions I have about the process.

Confirm what the levels actually are! 😆
- Described here: https://github.com/broadinstitute/lincs-cell-painting/issues/1#issuecomment-588411309
I assume I should add cytominer profiles here? We should consider the pycytominer-based profiles less mature (and therefore less stable)? The cytominer profiles are the ones that were originally computed.
- Located here: /home/ubuntu/bucket/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/2016_04_01_a549_48hr_batch1
- Were they processed using the standard profiling workflow?
  - https://github.com/cytomining/profiling-handbook

shntnu commented 4 years ago

I've addressed 1. via your comment
Is it blocking if I get to 2. mid next next week? I'd need to dig up a few things

gwaybio commented 4 years ago

Is it blocking if I get to 2. mid next next week? I'd need to dig up a few things

Not currently blocking 👍

shntnu commented 4 years ago

This was the workflow used (repo is private), which predates the profiling handbook.

My suggestion is to use the the Level 3 data in /home/ubuntu/bucket/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/2016_04_01_a549_48hr_batch1 and then recreate Level 4, 5 using the handbook. For 10 plates, you would also need to create the Level 3 data, because they were processed incorrectly (see below).

If you have the time, I would suggested comparing the level 3, 4, 5 data to that generated using pycytominer (assuming you already have this), and if the differences are within expected floating point errors, then using the pycytominer version instead (given that at least at the moment, cytominer is not on CRAN and therefore not easily reproducible)

A hack for figuring out which statistic was used to summarize single cell profiles:

x <-
  list.files(
    "../../backend/2016_04_01_a549_48hr_batch1/",
    recursive = TRUE,
    full.names = TRUE,
    pattern = "*augmented.csv"
  ) %>%
  map_df(function(fname) {
    read_csv(
      fname,
      col_types =
        cols_only(
          Cells_AreaShape_Area = col_double(),
          Metadata_Plate = col_character()
        )
    )
  })
x %<>%
  group_by(Metadata_Plate) %>%
  summarise(is_median = sum(is_median), n = n())
x %<>%
  mutate(is_median = ceiling(Cells_AreaShape_Area * 2) == Cells_AreaShape_Area * 2)
x %<>%
  group_by(Metadata_Plate) %>% summarise(is_median = sum(is_median), n = n())
x %<>%
  filter(is_median != n)
x %>%
  knitr::kable()

The level 3 data for these plates were created using means, not medians and should be reprocessed to using medians.

Metadata_Plate	is_median	n
SQ00015116	0	384
SQ00015117	0	384
SQ00015118	1	384
SQ00015119	1	384
SQ00015120	1	384
SQ00015121	1	384
SQ00015122	0	384
SQ00015123	1	384
SQ00015125	1	384
SQ00015126	2	384

gwaybio commented 4 years ago

I am working through confirming pycytominer and cytominer equivalency. I added cytomining/pycytominer#72 to mirror the cytominer "robust" function.

I tested this using one example plate (SQ00014814) and level 3 and level 4a data that were derived from pycytominer in broadinstitute/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad#3 (private repo) and cytominer, presumably from this pipeline.

I am noting here an observation and potential discrepancy in the cytominer processing details. Specifically, I noticed that the cytominer processing (link above) notes that the plate was normalized against DMSO. However, when I compare the cytominer results to the pycytominer results, the cytominer results are closer to pycytominer results when normalizing against the whole plate. Furthermore, the cytominer level 4a results are more similar to cytominer level 3 results processed with pycytominer using all samples vs. using DMSO samples.

from pycytominer.cyto_utils import infer_cp_features
from pycytominer import normalize

cytominer_df = data["level_three"]["cytominer"]
pycytominer_df = data["level_three"]["pycytominer"]
pycytominer_df.Metadata_broad_sample = pycytominer_df.Metadata_broad_sample.fillna("DMSO")

cp_cols = infer_cp_features(pycytominer_df)

# Process pycytominer level 3 data with two different normalization strategies
pycytominer_norm_all_df = normalize(
    profiles=pycytominer_df,
    features="infer",
    method="mad_robustize",
    samples="all",
    output_file="none",
).loc[:, cp_cols]

pycytominer_norm_dmso_df = normalize(
    profiles=pycytominer_df,
    features="infer",
    method="mad_robustize",
    samples="Metadata_broad_sample == 'DMSO'",
    output_file="none",
).loc[:, cp_cols]

# Process cytominer level 3 data with two different normalization strategies
cytominer_norm_all_df = normalize(
    profiles=cytominer_df,
    features="infer",
    method="mad_robustize",
    samples="all",
    output_file="none",
).loc[:, cp_cols]

cytominer_norm_dmso_df = normalize(
    profiles=cytominer_df,
    features="infer",
    method="mad_robustize",
    samples="Metadata_broad_sample == 'DMSO'",
    output_file="none",
).loc[:, cp_cols]

# Also load cytominer level four data
cytominer_level_four_df = data["level_four"]["cytominer"]

# Screenshot of results below
((cytominer_level_four_df - pycytominer_norm_all_df).sum().abs() > 1e-10).sum()
((cytominer_level_four_df - pycytominer_norm_dmso_df).sum().abs() > 1e-10).sum()
((cytominer_level_four_df - cytominer_norm_all_df).sum().abs() > 1e-10).sum()
((cytominer_level_four_df - cytominer_norm_dmso_df).sum().abs() > 1e-10).sum()

Summary

Cytominer normalization seems closer to whole plate normalization than DMSO normalization.

Question

Which is the most appropriate normalization strategy? All pycytominer-derived data is based on a DMSO-normalized strategy. This includes the results with batch effect observed across DMSO wells broadinstitute/cell-health#84

gwaybio commented 4 years ago

Note an update to robustize_mad in cytomining/pycytominer#74 that deals with missing values and divide by zero errors. It will however, potentially introduce exploding features (features containing extreme values)

gwaybio commented 4 years ago

closing in favor of #22

broadinstitute / lincs-cell-painting

Adding Level 3-5 Cell Painting Data Questions #3

Summary

Question