Closed gwaybio closed 4 years ago
Is it blocking if I get to 2. mid next next week? I'd need to dig up a few things
Not currently blocking 👍
This was the workflow used (repo is private), which predates the profiling handbook.
My suggestion is to use the the Level 3 data in /home/ubuntu/bucket/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/2016_04_01_a549_48hr_batch1
and then recreate Level 4, 5 using the handbook. For 10 plates, you would also need to create the Level 3 data, because they were processed incorrectly (see below).
If you have the time, I would suggested comparing the level 3, 4, 5 data to that generated using pycytominer
(assuming you already have this), and if the differences are within expected floating point errors, then using the pycytominer
version instead (given that at least at the moment, cytominer
is not on CRAN and therefore not easily reproducible)
A hack for figuring out which statistic was used to summarize single cell profiles:
x <-
list.files(
"../../backend/2016_04_01_a549_48hr_batch1/",
recursive = TRUE,
full.names = TRUE,
pattern = "*augmented.csv"
) %>%
map_df(function(fname) {
read_csv(
fname,
col_types =
cols_only(
Cells_AreaShape_Area = col_double(),
Metadata_Plate = col_character()
)
)
})
x %<>%
group_by(Metadata_Plate) %>%
summarise(is_median = sum(is_median), n = n())
x %<>%
mutate(is_median = ceiling(Cells_AreaShape_Area * 2) == Cells_AreaShape_Area * 2)
x %<>%
group_by(Metadata_Plate) %>% summarise(is_median = sum(is_median), n = n())
x %<>%
filter(is_median != n)
x %>%
knitr::kable()
The level 3 data for these plates were created using means, not medians and should be reprocessed to using medians.
Metadata_Plate | is_median | n |
---|---|---|
SQ00015116 | 0 | 384 |
SQ00015117 | 0 | 384 |
SQ00015118 | 1 | 384 |
SQ00015119 | 1 | 384 |
SQ00015120 | 1 | 384 |
SQ00015121 | 1 | 384 |
SQ00015122 | 0 | 384 |
SQ00015123 | 1 | 384 |
SQ00015125 | 1 | 384 |
SQ00015126 | 2 | 384 |
I am working through confirming pycytominer and cytominer equivalency. I added cytomining/pycytominer#72 to mirror the cytominer "robust" function.
I tested this using one example plate (SQ00014814
) and level 3 and level 4a data that were derived from pycytominer in broadinstitute/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad#3 (private repo) and cytominer, presumably from this pipeline.
I am noting here an observation and potential discrepancy in the cytominer processing details. Specifically, I noticed that the cytominer processing (link above) notes that the plate was normalized against DMSO. However, when I compare the cytominer results to the pycytominer results, the cytominer results are closer to pycytominer results when normalizing against the whole plate. Furthermore, the cytominer level 4a results are more similar to cytominer level 3 results processed with pycytominer using all samples vs. using DMSO samples.
from pycytominer.cyto_utils import infer_cp_features
from pycytominer import normalize
cytominer_df = data["level_three"]["cytominer"]
pycytominer_df = data["level_three"]["pycytominer"]
pycytominer_df.Metadata_broad_sample = pycytominer_df.Metadata_broad_sample.fillna("DMSO")
cp_cols = infer_cp_features(pycytominer_df)
# Process pycytominer level 3 data with two different normalization strategies
pycytominer_norm_all_df = normalize(
profiles=pycytominer_df,
features="infer",
method="mad_robustize",
samples="all",
output_file="none",
).loc[:, cp_cols]
pycytominer_norm_dmso_df = normalize(
profiles=pycytominer_df,
features="infer",
method="mad_robustize",
samples="Metadata_broad_sample == 'DMSO'",
output_file="none",
).loc[:, cp_cols]
# Process cytominer level 3 data with two different normalization strategies
cytominer_norm_all_df = normalize(
profiles=cytominer_df,
features="infer",
method="mad_robustize",
samples="all",
output_file="none",
).loc[:, cp_cols]
cytominer_norm_dmso_df = normalize(
profiles=cytominer_df,
features="infer",
method="mad_robustize",
samples="Metadata_broad_sample == 'DMSO'",
output_file="none",
).loc[:, cp_cols]
# Also load cytominer level four data
cytominer_level_four_df = data["level_four"]["cytominer"]
# Screenshot of results below
((cytominer_level_four_df - pycytominer_norm_all_df).sum().abs() > 1e-10).sum()
((cytominer_level_four_df - pycytominer_norm_dmso_df).sum().abs() > 1e-10).sum()
((cytominer_level_four_df - cytominer_norm_all_df).sum().abs() > 1e-10).sum()
((cytominer_level_four_df - cytominer_norm_dmso_df).sum().abs() > 1e-10).sum()
Cytominer normalization seems closer to whole plate normalization than DMSO normalization.
Which is the most appropriate normalization strategy? All pycytominer-derived data is based on a DMSO-normalized strategy. This includes the results with batch effect observed across DMSO wells broadinstitute/cell-health#84
Note an update to robustize_mad
in cytomining/pycytominer#74 that deals with missing values and divide by zero errors. It will however, potentially introduce exploding features (features containing extreme values)
closing in favor of #22
I am in the process of adding level 3-5 profiles to this repo (using git lfs). I will use this issue to document various questions I have about the process.
cytominer
profiles here? We should consider the pycytominer-based profiles less mature (and therefore less stable)? The cytominer profiles are the ones that were originally computed./home/ubuntu/bucket/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/2016_04_01_a549_48hr_batch1