broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

Adding batch 2 consensus profiles #61

Closed gwaybio closed 3 years ago

gwaybio commented 3 years ago

Here, I add consensus profiles for batch 2 profiles. I also add Metadata_cell_id to the aggregation columns for both batches (batch 2 has three cell lines). I make some minor changes throughout the notebook.

We find only 1,620 consensus profiles in batch 2 (we have 8,340 in batch 1).

shntnu commented 3 years ago

We find only 1,620 consensus profiles in batch 2 (we have 8,340 in batch 1).

This must be a metadata issue or a missing grouping column

Batch 2 has 3 cell lines x 3 dose points x 3 time points x 360 compounds = ~9720 (not exact because some compounds might be missing all doses)

shntnu commented 3 years ago

Here is the exact number of consensus profiles for batch 2

```r library(tidyverse) platemaps <- c("https://raw.githubusercontent.com/gwaygenomics/lincs-cell-painting/batch2-consensus/metadata/platemaps/2017_12_05_Batch2/platemap/ASG003_A549_24H.txt", "https://raw.githubusercontent.com/gwaygenomics/lincs-cell-painting/batch2-consensus/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP001_A549_24H.txt", "https://raw.githubusercontent.com/gwaygenomics/lincs-cell-painting/batch2-consensus/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP002_A549_24H.txt") n_cell_lines <- 3 n_time_points <- 3 platemaps %>% map_df(read_tsv) %>% distinct(broad_sample, mmoles_per_liter) %>% tally(name = "n_consensus") %>% mutate(n_consensus = n_consensus * n_cell_lines * n_time_points) %>% knitr::kable() ```
n_consensus
9396
gwaybio commented 3 years ago

missing time as a grouping column, thanks!

gwaybio commented 3 years ago

this turned out to be an even larger problem. the aggregate function will drop samples if one of their aggregating columns (strata) has missing values. eek! I opened cytomining/pycytominer#133 to resolve this globally, but for this PR, my solution is to recode missing values as "unknown". This only impacts the MOA and target columns.

This impacted both batches of data, but batch 2 substantially more. Batch 2 now has 10,368 consensus profiles. Note that your example above does not include platemaps from multiple time points.

Also note that I do update MOAs in the profiling step for both batches:

https://github.com/broadinstitute/lincs-cell-painting/blob/d471bbd38d9a13aa3ea1681337718bcba552fa16/profiles/profile_cells.py#L73-L83

But i wonder if I need to update the external moa file first with the new batch broad ids...

https://github.com/broadinstitute/lincs-cell-painting/blob/d471bbd38d9a13aa3ea1681337718bcba552fa16/profiles/profiling_pipeline.py#L46-L50

gwaybio commented 3 years ago

in other words, if I have to do this, then I'll need to rerun the profiling pipeline again for at least batch 2 data

gwaybio commented 3 years ago

@shntnu - this PR is ready for review. Let's discuss a potential full reprocessing in #62. We need not decide to reprocess in full before merging this PR.

shntnu commented 3 years ago

the aggregate function will drop samples if one of their aggregating columns (strata) has missing values.

Wow, glad you found it! Bad 🐼 !

This impacted both batches of data, but batch 2 substantially more. Batch 2 now has 10,368 consensus profiles. Ah that's because we are using pert_well in grouping (as per plan 👍 ). 3 x 3 x 3 x 384 = 10,368

Note that your example above does not include platemaps from multiple time points.

For our notes: It does actually – there are only 3 unique platemaps (containing 3 doses x ~360 compounds), so I read 3 of them then multiplied that by 3x3. But that example is useless given that we are computing consensus by including the pert_well column :D

shntnu commented 3 years ago

this PR is ready for review.

lgtm