Should we reprocess all profiles before frozen data release?

gwaybio commented 3 years ago

I am leaning towards doing this. To work toward reprocessing, we need to accomplish the following:

[x] ~release pycytominer version 0.1. It will be great to include a stable pycytominer version in the conda environment. We've upgraded pycytominer so much since the original reprocessing, and rerunning profiles will ease headaches (see below).~ (Decided not to pursue)
[x] update MOA map for batch 2 data (see https://github.com/broadinstitute/lincs-cell-painting/pull/61#issuecomment-804136814)
- Resolved in https://github.com/broadinstitute/lincs-cell-painting/issues/62#issuecomment-812531389

What headaches will an updated pycytominer resolve?

the updated pycytominer fixes no-name gzip flab (#50)
updated naming convention "blacklist" -> "blocklist"
potential to change epsilon in spherize()

Rerunning the pipeline will also enable us to migrate from git lfs to dvc.

Time estimate

Runtime will take non-negligible time, probably ~1 week, but it will increase confidence and organization of the data.
Migrating from git lfs to dvc will take 4 hours
Releasing pycytominer version 0.1 will take longer. I think we are close to an official version 0.1 release https://github.com/cytomining/pycytominer/milestone/1

shntnu commented 3 years ago

@gwaygenomics I've updated the time estimate section in your top post. I'm not sure how long 2. will take, but if it's not too long, I propose we do 1 and 2, but not 3 (unless you think it's feasible for you to do it, given everything else going on)

gwaybio commented 3 years ago

Sounds good. dvc will not take long (couple hours) i will use #63 to track 1 (I hope to get this running tomorrow) and will open a new PR for 2

shntnu commented 3 years ago

a new PR for 2

If possible, it will be super helpful if you can add some notes for migrating/setting up dvc to this issue:
https://github.com/cytomining/profiling-template/issues/13 (rough notes are perfectly fine, especially given your time constraints).

gwaybio commented 3 years ago

in https://github.com/broadinstitute/lincs-cell-painting/pull/61#issuecomment-804136814, I said:

But i wonder if I need to update the external moa file first with the new batch broad ids...

Facing this now. @shntnu, do you have any historical knowledge about how these broad ids might have differed from the pilot?

In n=1, one plate from batch 2 has only 13 MOAs matched in repurposing_info_external_moa_map_resolved.tsv, while batch 1 plates have ~60.

shntnu commented 3 years ago

do you have any historical knowledge about how these broad ids might have differed from the pilot?

@tnat1031 do you happen to know the answer to this? The details below might help recap.

library(tidyverse)
platemaps <- 
  c("https://raw.githubusercontent.com/broadinstitute/lincs-cell-painting/master/metadata/platemaps/2017_12_05_Batch2/platemap/ASG003_A549_24H.txt",
    "https://raw.githubusercontent.com/broadinstitute/lincs-cell-painting/master/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP001_A549_24H.txt",
    "https://raw.githubusercontent.com/broadinstitute/lincs-cell-painting/master/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP002_A549_24H.txt") 

n_cell_lines <- 3
n_time_points <- 3

lkcp_broad_samples <- 
  platemaps %>%
  map_df(read_tsv, col_types = cols()) %>% 
  distinct(broad_sample)

lkcp_broad_samples %>% 
  sample_n(10) %>%
  knitr::kable()

broad_sample
BRD-K41599323-001-01-5
BRD-K59325863-001-03-6
BRD-K19034817-001-04-8
BRD-K92723993-001-12-5
BRD-K70301876-034-06-1
BRD-K57252450-001-02-5
BRD-A87130939-001-07-9
BRD-K12906202-001-06-2
BRD-K15567136-003-03-3
BRD-A78195072-001-06-2

lkcp_broad_samples %>%
  count() %>%
  knitr::kable()

n
349

^{Created on 2021-04-02 by the reprex package (v0.3.0)}

tnat1031 commented 3 years ago

@shntnu @gwaygenomics If I recall correctly I think the batch 2 compounds (aka LKCP) were not explicitly chosen to overlap with the pilot compounds. Rather, it was an experiment designed to compare the L1000 and CP readouts with exactly the same conditions (compounds, cell lines, doses, replicates, time points exactly matched).

gwaybio commented 3 years ago

thanks @tnat1031 - the specific question is if you know why the majority of these compounds do not align with CMAP broad ID annotations. Were they experimental compounds lacking MOA/target info?

shntnu commented 3 years ago

Thanks for looking into this @tnat1031

the specific question is if you know why the majority of these compounds do not align with CMAP broad ID annotations.

Exactly

Were they experimental compounds lacking MOA/target info?

@tnat1031 perhaps this doc might help you recollect?

gwaybio commented 3 years ago

looks like one of the tables linked in that doc indicates that many of these broad IDs do indeed have MOA annotations.

Two comments:

I don't see TARGET info
It's possible that we already have all annotations present in that document, it does seem like there might be fewer than in batch 1 (I will check)

tnat1031 commented 3 years ago

I think one possible issue could be that the 'official' CMap MoA/target annotations from batch 1 were incomplete. These annotations were (and still are) pretty consistently in flux, and it's possible the annotations in the google spreadsheet do not match those in the CMap file you've been using. They should all be annotated though, as none of them are experimental compounds. Are the annotations very different or are they different terms (or spellings) that have similar meaning?

One solution I can think of is to just use whatever MoA/target annotations are currently provided in the repurposing hub as a reputable 3rd party source for this information, then freeze it with the data. I realize this might impact Adeniyi's MoA classification results. Is re-training and re-testing those classifiers prohibitive?

gwaybio commented 3 years ago

These annotations were (and still are) pretty consistently in flux, and it's possible the annotations in the google spreadsheet do not match those in the CMap file you've been using.

I see. This is aligns with our experience. We're actually using a maximally aligned MOA file using all previous, publicly available CMAP annotation resources. In my opinion, all of these fixes should happen upstream of this repo, so I agree with this plan:

One solution I can think of is to just use whatever MoA/target annotations are currently provided in the repurposing hub as a reputable 3rd party source for this information, then freeze it with the data.

This will not actually impact @adeboyeML's MOA classification work, since we're already using the maximally aligned annotations. We will, however, need to rerun anyway after data freeze and with spherized (aka whitened) data.

In attempt to solve these problems upstream, I'll tag @jrsacher. Josh has helped us a ton in getting the best possible alignment of CMAP MOA/Target annotations. Josh, I see that you're no longer at the Broad. If you don't mind, can you connect us with the cheminformatics data scientist who would be most able to help us resolve these issues?

Thanks!

jrsacher commented 3 years ago

Chuck Perry (cperry@broadinstitute.org) has taken over Repurposing from a chemistry perspective. As far as I'm aware, there isn't anyone in a pure cheminformatics role anymore, but he may be able to help with the annotation data. I'm still around as a consultant to CDoT, so if there's anything technical or that Chuck isn't comfortable handling, I can probably help out.

tnat1031 commented 3 years ago

Ok cool, that sounds good to me. Thanks everyone.

shntnu commented 3 years ago

Thanks @tnat1031 and @jrsacher!

gwaybio commented 2 years ago

Hi @jrsacher - we are wrapping up this paper now, and we'd like to include you in our acknowledgements section. We will write something to the effect of "We'd like to thank Joshua Sacher for his help in curating Drug Repurposing Hub compound metadata."

Do we have your permission to include you in this section? Thanks again for all of your expertise with this effort!

jrsacher commented 2 years ago

Absolutely! I appreciate the appreciation!

gwaybio commented 2 years ago

Will do! Thanks again!

broadinstitute / lincs-cell-painting

Should we reprocess all profiles before frozen data release? #62

Time estimate