Old/Updated Broad IDs - Githubissues

gwaybio commented 4 years ago

I have encountered perhaps a significant hurdle in adding Cell Painting Repurposing Hub profiles to this repo.

There are broad ids (pert_id) in the profile data that are absent from the updated moa information.

For example, in one plate (SQ00014814) the following pert_ids are present (with annotations) in the profile data, but are absent in the repurposing moa files in this repo:

['BRD-A69275535', 'BRD-A69636825', 'BRD-A69815203', 'BRD-A72309220', 'BRD-A72390365', 'BRD-A74980173', 'BRD-A82156122', 'BRD-K50691590', 'BRD-K68164687', 'BRD-K71480163', 'BRD-K81258678', 'BRD-K81957469']

Given that these pert_ids have annotations in cytominer-derived profiles, this indicates that the pert_ids have changed somewhere.

Before I pursue this issue, I was wondering if there are any known solutions or datasets that map old to updated pert_ids. cc @shntnu @niranjchandrasekaran

Perhaps also @jrsacher has insight here. Josh, I scanned the CLUE and DepMap resources and was not able to find a map. I also checked the column deprecated_broad_id and I was able to recover 3 of the profiles (['BRD-K50691590', 'BRD-K50691590', 'BRD-K81258678']).

Any insights or pointers here would be greatly appreciated!

gwaybio commented 4 years ago

Update I am currently scanning the deprecated_broad_id column more closely. It appears that there are sometimes multiple deprecated ids separated by pipes. It's still possible this column stores the answer we're looking for.

Either way, this begs the question: which pert_ids should we include in the updated Cell Painting profiles? The updated pert_ids will be more comparable to future Drug Repurposing Hub data, but old pert_ids will be more comparable to previous internal analyses performed using these data 🤔

gwaybio commented 4 years ago

It's still possible this column stores the answer we're looking for.

Nevermind, this does not resolve the issue. In #12 I add a map of new to deprecated IDs and include all deprecated IDs separated by pipes. The same three profiles (as mentioned above) are still the only three resolved.

There must be a map somewhere! 🌏

niranjchandrasekaran commented 4 years ago

It looks like the repurposing samples file from 2018 and 2020 have some differences. I haven't compared the files thoroughly but when I manually looked for the first 3 BRD IDs in your missing compounds list (#11 Comment), I found all three of them in the 2018 file while all three are missing in 2020 file. I am not sure if they are actually missing or if they have been assigned alternate BRD IDs (I didn't look carefully enough). Do you think this could explain the missing compounds list?

niranjchandrasekaran commented 4 years ago

Some more info...

BRD-A69275535 corresponds to pinitol in the 2018 file; pinitol is present in the 2020 file but with a different ID - BRD-K87873585
BRD-A69636825 corresponds to diltiazem in the 2018 file; the 2018 file has four diltiazem entries but the ID of other three entries is BRD-K24023109. In the 2020 file all four entries have the ID BRD-K24023109
BRD-A69815203 corresponds to cyclosporin-a in the 2018 file; there are 5 entries for cyclosporin-A in the 2020 file but all have the ID BRD-K13533483. Curiously in the 2018 file cyclosporine has the ID BRD-K13533483

shntnu commented 4 years ago

@jrsacher has previously said (Aug 13, 2019)

The more interesting/difficult list is when 1 name has more than 1 Core ID (see Actinomycin-d, for instance). There will be ~400 curations/changes in the next update to begin to address this.

It is possible that the curation resulted in ids being deprecated but not being added to the deprecated_broad_id list. @jrsacher can hopefully confirm.

Regarding which broad_id to use once this is resolved: generally my preference would be to use the broad_id that was active when this experiment was run. But given that we are using updated repurposing hub metadata here, that would get a bit confusing.

But assuming the missing deprecated id problem is solved, it shouldn't really matter. e.g. this problem will go away

BRD-A69275535 corresponds to pinitol in the 2018 file; pinitol is present in the 2020 file but with a different ID - BRD-K87873585

So the way to address this is to tidy the metadata file i.e. pull out the deprecated ids into separate rows, then do an inner join with the old metadata files for LINCS on the broad_id column and we are all set. Does that make sense? LMK if you want to delegate any of this to either of us, @gwaygenomics

jrsacher commented 4 years ago

It turns out keying a database on something that can change -- like a compound name or a structure ID -- is not a great idea. Unfortunately, there's not really one "perfect" unique ID for chemicals. I think InChI14 may come close, but may not distinguish stereoisomers. Add into all that the fact that we migrated applications/databases that handle our chemical inventory between the two updates and I'm honestly surprised it's not more of a mess.

We may be restructuring the DB in the future, so I'm open to suggestions about how to best handle this.

Some details:

The internal Core ID (pert_id, BRD-[AKMU]00000000) is specific to a chemical structure. For some compounds, like cyclosporin, we may have received multiple structures from different vendors, which results in more than one Core ID mapping to a name. I've tried to curate them so that all pert_inames map to one pert_id, but I may have missed some.

Alternatively, pert_inames can be recorded incorrectly or change. There were some spelling mistakes, some of the same things recorded as separate entries that should be identical (cyclosporin/cyclosporin-a), and some updates (a drug getting an INN when moving from clinical to approved, for example).

Attached is a file with curation information I can pull on those Core IDs listed above. Let me know what other digging I can do! DotmaticsExport.xlsx

gwaybio commented 4 years ago

@niranjchandrasekaran

I am not sure if they are actually missing or if they have been assigned alternate BRD IDs (I didn't look carefully enough). Do you think this could explain the missing compounds list?

I do. Especially given evidence you cite in https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-611800240. Thanks for digging deeper into this 🕵️ I think the core of the issue is as @shntnu mentioned:

It is possible that the curation resulted in ids being deprecated but not being added to the deprecated_broad_id list

@shntnu

generally my preference would be to use the broad_id that was active when this experiment was run. But given that we are using updated repurposing hub metadata here, that would get a bit confusing.

I agree 💯 . This is very tricky... the CP platemaps themselves contain deprecated broad ids and it would not make much sense to update these since there are differences (as @jrsacher notes)

The internal Core ID (pert_id, BRD-[AKMU]00000000) is specific to a chemical structure. For some compounds, like cyclosporin, we may have received multiple structures from different vendors, which results in more than one Core ID mapping to a name.

Therefore, I am in favor of resolving the deprecated IDs in the moa metadata in this repo, adding them to the existing files, and then proceeding as planned. I agree that the best way forward is as you suggest:

pull out the deprecated ids into separate rows, then do an inner join with the old metadata files for LINCS on the broad_id column

Concretely:

Split the deprecated_broad_id column by pipe (see #12 )
Add deprecated info to repurposing_simple.tsv (and maybe the other two files as well). We can also add a boolean column (is_deprecated).
We left join the profile file with platemap metadata to repurposing_simple.tsv ignoring deprecation status. This will resolve some moa/target assignments. (e.g. in SQ00014814 this will resolve 3 of 12 missing metadata)
What to do with resolving the remaining 9? 🤷‍♂️ This is the hard part.

An easy solution for 4 is to use the Repurposing Hub metadata from 2018 which we know would map to the CP platemaps. This doesn't feel like the way to go though for some reason...

@jrsacher

Add into all that the fact that we migrated applications/databases that handle our chemical inventory between the two updates and I'm honestly surprised it's not more of a mess.

Totally understandable - data curation is not easy and it is extremely important. Also important to note that this is only one plate of 137. So it is definitely going to get messier 😂

Thanks again for your prompt response - it is greatly appreciated!

so I'm open to suggestions about how to best handle this.

I think your previous solution of adding deprecated IDs to a pipe separated deprecated_broad_id column is the right way to go. If this process is not already automated, then I strongly recommend limiting manual steps - this process is essential for mapping legacy data and I totally get how easy it can be to fall behind.

This solution is also how folks in the genetics community map gene names. They've been thinking about this problem for quite some time now. Gene symbols (e.g. like KRAS, TP53) have so many synonyms. Resources that map gene names across synonyms and identifiers are essential (see https://mygene.info/). 👈 this is a pretty heavyweight solution, but it does illustrate how important these things are :)

How do you suggest we proceed with step 4 above?

Is it worth going through the other Repurposing Hub sample/drug versions and pulling out every unique broad_id?

Also, @shntnu

LMK if you want to delegate any of this to either of us, @gwaygenomics

It is my understanding that this needs to happen in the JUMP-CP project as well. Given that we should avoid duplicating efforts, if you and @niranjchandrasekaran want to tag team on the heavy lifting, I'd be happy reviewing pull requests that add these details to the repo.

niranjchandrasekaran commented 4 years ago

I am mostly thinking out loud here (@gwaygenomics please correct me if what I say doesn't make sense). The main purpose of this exercise is to map profiles of repurposing drugs to the latest MOAs by taking the following path

Broad ID of profiles -> repurposing_samples 2020 broad_id -> repurposing_samples 2020 pert_iname-> repurposing_drugs 2020 pert_iname -> repurposing_drugs 2020 target

But, it looks like the step - Broad ID of profiles -> repurposing_samples 2020 broad_id is probably not possible for all drugs.

Coming to my proposed solution, based on Josh's suggestion in #11 comment, if we used InChI14 (the first 14 characters of InChIKey), the above path will change into the following

Broad ID of profiles -> repurposing_samples 2018 broad_id -> repurposing_samples 2018 InChIKey InChI14 -> repurposing_samples 2020 InChIKey InChI14 -> repurposing_samples 2020 pert_iname-> repurposing_drugs 2020 pert_iname -> repurposing_drugs 2020 target

For the 3 examples in #11 comment by taking this path, I am able to map broad_id 2018 to broad_id 2020. As Josh noted in #11 comment, InChI14 does not capture stereochemistry which is a drawback of this approach. But my question is, for the purpose of mapping MOAs, does the stereochemistry matter? Given biology does care about stereochemistry, it probably does matter but for the 3 examples, it doesn't seem to.

2018 `pert_iname`	2020 `pert_iname`	2018 `broad_id`	`moa`	2018 `target`	2020 `target`
pinitol	pinitol	`BRD-A69275535`	gamma secretase inhibitor
diltiazem	diltiazem	`BRD-A69636825`	calcium channel blocker	CACNA1C CACNA1S CACNA2D1 CACNG1 HTR3A KCNA5	CACNA1C CACNA1S CACNA2D1 CACNG1 HTR3A KCNA5
cyclosporin-a	cyclosporin-A	`BRD-A69815203`	calcineurin inhibitor	PPP3CA	ABCB11 CAMLG FPR1 PPIA PPIF PPP3CA PPP3R2 SLC10A1 SLCO1B1 SLCO1B3

Note:

pinitol in 2018 and 2020 dataset are steroisomers but the MOA is the same for both
Though the targets for cyclosporin-a are different in the 2018 and 2020 lists, as mentioned in #11 comment, there are two cyclosporine entries in 2018 (cyclosporin-a and cyclosporine). If the targets of both are combined, we get the same list of genes as the 2020 entry.

2018 cyclosporin-a	2018 cyclosporine	2020 cyclosporine-A
PPP3CA	ABCB11 CAMLG FPR1 PPIA PPIF PPP3R2 SLC10A1 SLCO1B1 SLCO1B3	ABCB11 CAMLG FPR1 PPIA PPIF PPP3CA PPP3R2 SLC10A1 SLCO1B1 SLCO1B3

There is a good chance that these three compounds are outliers. Perhaps we should compare the entire 2018 and 2020 lists to figure out if using InChI14 for mapping is a viable strategy and if losing stereochemistry information is inconsequential for MOA assignment in this dataset.

gwaybio commented 4 years ago

Broad ID of profiles -> repurposing_samples 2018 broad_id -> repurposing_samples 2018 InChIKey InChI14 -> repurposing_samples 2020 InChIKey InChI14 -> repurposing_samples 2020 pert_iname-> repurposing_drugs 2020 pert_iname -> repurposing_drugs 2020 target

YES! I love this approach

for the purpose of mapping MOAs, does the stereochemistry matter? Given biology does care about stereochemistry, it probably does matter but for the 3 examples, it doesn't seem to.

If we're going to lose signal somewhere, I think this solution minimizes it.

Perhaps we should compare the entire 2018 and 2020 lists to figure out if using InChI14 for mapping is a viable strategy and if losing stereochemistry information is inconsequential for MOA assignment in this dataset.

I am for this exercise. We might as well do it programmatically for every drug. I think it is also correct to quantify how many instances does the stereoisomer difference also correspond to moa differences. It will be tough (maybe impossible?) to tease apart stereoisomer differences from regular updates between 2018 and 2020.

@shntnu - one thing to confirm before @niranjchandrasekaran embarks on this analysis is to confirm the Drug Repurposing hub version used in the Cell Painting platemaps. It might not be the 2018 version given that the data were collected in 2015.

gwaybio commented 4 years ago

worth mentioning that I am for this approach and not just using old Broad IDs because we should leverage new MOA and target data @jrsacher and others worked to update.

shntnu commented 4 years ago

It might not be the 2018 version given that the data were collected in 2015.

The broad_id in the 28 platemaps files added here https://github.com/broadinstitute/lincs-cell-painting/pull/10/files are from 2015, if that answers your question.

gwaybio commented 4 years ago

The drug/sample info from CLUE starts at 2017 - https://clue.io/repurposing#download-data. Do we know which version of these resources we used? Maybe it is not listed here...

niranjchandrasekaran commented 4 years ago

Checking manually, all the missing pert_ids on the plate SQ00014814 maps to broad_id of at least one entry in 2018 version except for BRD-K81258678 which maps to a deprecated_broad_id. Given the 100% match, shall I assume that a combination of all pert_ids will maps of either broad_id or deprecated_broad_id? Or Is there a quick way to confirm this? @gwaygenomics is there a way for me to access the pert_ids of compounds on all the plates?

shntnu commented 4 years ago

The drug/sample info from CLUE starts at 2017 - https://clue.io/repurposing#download-data. Do we know which version of these resources we used? Maybe it is not listed here...

Not listed here apparently, but 2017 comes close.

Here's my notebook dump

``` r library(glue) library(magrittr) library(tidyverse) ``` ``` r platemap_ids <- list.files("~/work/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/metadata/2016_04_01_a549_48hr_batch1/platemap/", full.names = TRUE, pattern = ".txt") %>% map_df(read_tsv) %>% rename(broad_id = broad_sample) %>% mutate(pert_id = str_sub(broad_id, 1, 13)) %>% distinct(broad_id, pert_id) ``` ``` r clue_2017_ids <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20170327.txt", comment = "!") %>% mutate(pert_id = str_sub(broad_id, 1, 13)) %>% distinct(broad_id, pert_id) ``` ## Parsed with column specification: ## cols( ## broad_id = col_character(), ## pert_iname = col_character(), ## qc_incompatible = col_double(), ## purity = col_double(), ## vendor = col_character(), ## catalog_no = col_character(), ## vendor_name = col_character(), ## expected_mass = col_number(), ## smiles = col_character(), ## InChIKey = col_character(), ## pubchem_cid = col_double() ## ) ``` r clue_2018_ids <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20180516.txt", comment = "!") %>% mutate(pert_id = str_sub(broad_id, 1, 13)) %>% distinct(broad_id, pert_id) ``` ## Parsed with column specification: ## cols( ## broad_id = col_character(), ## pert_iname = col_character(), ## qc_incompatible = col_double(), ## purity = col_double(), ## vendor = col_character(), ## catalog_no = col_character(), ## vendor_name = col_character(), ## expected_mass = col_number(), ## smiles = col_character(), ## InChIKey = col_character(), ## pubchem_cid = col_double(), ## deprecated_broad_id = col_character() ## ) ``` r clue_2018_deprecated_ids_x <- read_tsv("https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20180516.txt", comment = "!") %>% pull(deprecated_broad_id) %>% paste(collapse = "|") %>% str_split("\\|") ``` ## Parsed with column specification: ## cols( ## broad_id = col_character(), ## pert_iname = col_character(), ## qc_incompatible = col_double(), ## purity = col_double(), ## vendor = col_character(), ## catalog_no = col_character(), ## vendor_name = col_character(), ## expected_mass = col_number(), ## smiles = col_character(), ## InChIKey = col_character(), ## pubchem_cid = col_double(), ## deprecated_broad_id = col_character() ## ) ``` r clue_2018_deprecated_ids_x <- unique(clue_2018_deprecated_ids_x[[1]]) clue_2018_deprecated_ids <- tibble(broad_id = clue_2018_deprecated_ids_x) %>% mutate(pert_id = str_sub(broad_id, 1, 13)) %>% bind_rows(clue_2018_ids) %>% distinct(broad_id, pert_id) ``` No deprecated ids column in 2017. Use 2018 column just in case that’s useful (but turns out not to be, below) ``` r clue_2017_deprecated_ids <- tibble(broad_id = clue_2018_deprecated_ids_x) %>% mutate(pert_id = str_sub(broad_id, 1, 13)) %>% bind_rows(clue_2017_ids) %>% distinct(broad_id, pert_id) ``` tl;dr The best we can get is with 2017, joining on pert\_ids (19 are missing) ``` r platemap_ids %>% distinct(broad_id) %>% anti_join(clue_2017_ids %>% distinct(broad_id) ) %>% knitr::kable() ``` ## Joining, by = "broad_id" | broad\_id | | :--------------------- | | NA | | BRD-K60230970-001-10-0 | | BRD-K03842655-001-02-1 | | BRD-A02006392-001-10-7 | | BRD-K80738081-001-26-0 | | BRD-K89732114-300-08-9 | | BRD-K73395020-001-02-3 | | BRD-K01192156-001-02-7 | | BRD-A84045418-001-03-1 | | BRD-K60623809-001-02-0 | | BRD-A20131130-001-01-7 | | BRD-K31342827-001-08-8 | | BRD-K28470988-001-02-0 | | BRD-K01638814-051-10-1 | | BRD-K03816923-001-05-4 | | BRD-A67373739-001-02-2 | | BRD-A77216878-001-01-4 | | BRD-K36324071-363-01-3 | | BRD-A58280226-312-01-3 | | BRD-A69951442-001-01-3 | | BRD-A43331270-001-01-6 | | BRD-K07572174-001-17-0 | | BRD-K41895714-001-01-4 | | BRD-K23875128-001-04-2 | | BRD-K04887706-375-01-4 | | BRD-A84481105-003-20-6 | | BRD-A62025033-001-01-8 | | BRD-A05457250-001-05-0 | | BRD-K41996876-001-06-3 | ``` r platemap_ids %>% distinct(pert_id) %>% anti_join(clue_2017_ids %>% distinct(pert_id) ) %>% knitr::kable() ``` ## Joining, by = "pert_id" | pert\_id | | :------------ | | NA | | BRD-K03842655 | | BRD-K73395020 | | BRD-K01192156 | | BRD-A84045418 | | BRD-K60623809 | | BRD-A20131130 | | BRD-K03816923 | | BRD-A67373739 | | BRD-A77216878 | | BRD-K36324071 | | BRD-A58280226 | | BRD-A69951442 | | BRD-A43331270 | | BRD-K41895714 | | BRD-K23875128 | | BRD-K04887706 | | BRD-A62025033 | | BRD-K41996876 | Details are below ``` r platemap_ids %>% distinct(broad_id) %>% anti_join(clue_2017_ids %>% distinct(broad_id) ) %>% count() %>% knitr::kable() ``` ## Joining, by = "broad_id" | n | | -: | | 29 | ``` r platemap_ids %>% distinct(pert_id) %>% anti_join(clue_2017_ids %>% distinct(pert_id) ) %>% count() %>% knitr::kable() ``` ## Joining, by = "pert_id" | n | | -: | | 19 | ``` r platemap_ids %>% distinct(broad_id) %>% anti_join(clue_2017_deprecated_ids %>% distinct(broad_id) ) %>% count() %>% knitr::kable() ``` ## Joining, by = "broad_id" | n | | -: | | 29 | ``` r platemap_ids %>% distinct(pert_id) %>% anti_join(clue_2017_deprecated_ids %>% distinct(pert_id) ) %>% count() %>% knitr::kable() ``` ## Joining, by = "pert_id" | n | | -: | | 19 | ``` r platemap_ids %>% distinct(broad_id) %>% anti_join(clue_2018_ids %>% distinct(broad_id) ) %>% count() %>% knitr::kable() ``` ## Joining, by = "broad_id" | n | | -: | | 54 | ``` r platemap_ids %>% distinct(pert_id) %>% anti_join(clue_2018_ids %>% distinct(pert_id) ) %>% count() %>% knitr::kable() ``` ## Joining, by = "pert_id" | n | | -: | | 44 | ``` r platemap_ids %>% distinct(broad_id) %>% anti_join(clue_2018_deprecated_ids %>% distinct(broad_id) ) %>% count() %>% knitr::kable() ``` ## Joining, by = "broad_id" | n | | -: | | 54 | ``` r platemap_ids %>% distinct(pert_id) %>% anti_join(clue_2018_deprecated_ids %>% distinct(pert_id) ) %>% count() %>% knitr::kable() ``` ## Joining, by = "pert_id" | n | | -: | | 44 |

*Note @gwaygenomics edit adding collapsible section (to reduce issue bloat)

niranjchandrasekaran commented 4 years ago

Given @shntnu's analysis, I will modify my approach to the following

Profiles pert_id -> repurposing_samples 2017 broad_id -> repurposing_samples 2017 InChIKey InChI14 -> repurposing_samples 2020 InChIKey InChI14 -> repurposing_samples 2020 pert_iname-> repurposing_drugs 2020 pert_iname -> repurposing_drugs 2020 target

I will think about this more carefully on Monday to make sure it makes sense.

gwaybio commented 4 years ago

thinking about this a bit more... what do you think of separating this deprecated ID map from the versioned broad id drugs/samples?

in other words, @niranjchandrasekaran's workflow would create a two/three column dictionary of:

broad_id	deprecated_broad_id	deprecated_version
BRD-K87873585	BRD-A69275535	2018
BRD-K24023109	BRD-A69636825	2018
BRD-K13533483	BRD-A69815203	2018

This will enable us to use this file as a intermediate step to map any repurposing hub version to moa/target info. This may solve both goals:

Do not change Metadata_broad_id in the Repurposing Hub Cell Painting Dataset
Include moa and target info for each Metadata_broad_id regardless of deprecation year

Also, maybe worth mentioning that there seems to be two roads to deprecation (and maybe they are actually equivalent)

Officially moved over to deprecated_broad_id by the CLUE team
Silently dropped between repurposing versions

Maybe we can document this as well in a fourth column 🤷‍♂️@niranjchandrasekaran - does this make sense, would it work?

niranjchandrasekaran commented 4 years ago

@gwaygenomics That's a great idea to have intermediate maps instead of a single map from profile Broad ID to targets/moa. I will look into what format would work best.

jrsacher commented 4 years ago

One additional thought that might help:

If you have the plate barcodes that you've tested over the years, we could likely run them to pull the current Broad IDs. This could give you a mapping on a per-well level.
Depending on what Compound Management has provided you over the years, there's also an internal Sample ID (starts with SA) that is a unique identifier in the CBIP database.

Most (if not all) of the compounds we silently dropped in this version had to be excluded for regulatory reasons. A few were dropped because they aren't compatible with screening (insoluble inorganics, polymers, etc.). If you need that list, I can provide it.

gwaybio commented 4 years ago

If you have the plate barcodes that you've tested over the years, we could likely run them to pull the current Broad IDs.

For the specific dataset in this repo (Cell Painting of a subset of Drug Repurposing Hub Compounds) as long as this exercise maps everything to current Broad IDs and the current Broad IDs have moa/target info then, this is the minimum requirement to proceed with metadata annotation.

In other words, if we provide a list of all unique Broad IDs in this project and then a current Broad ID list is retrieved using clue.io-specific resources that provides a 1-1 mapping of moa/target info, we're all set for lincs_cell_painting (after a quick fidelity check).

Essentially, this is what @niranjchandrasekaran has proposed to do for all repurposing hub broad IDs by using intermediate files and columns. Maybe (especially for the JUMP-CP project) we want to create the deprecated map in addition to solving this repo's specific issue?

@jrsacher - this is great. A couple of followup questions:

I imagine that this internal mapping is private/internal to CMAP. Is there any way this resource could be made public on the CLUE website? It would make processing and data provenance easier to track for this repo (and other projects too).
In what format would you like me to send you a list of Broad IDs? A one column text file? Do you need any other metadata info?
Would you have an estimated time/effort that this would take? I think we have a pretty good sense of effort to pursue the alternative method - I imagine querying the Database with a file would be fairly straightforward. Please let us know if not!

jrsacher commented 4 years ago

I imagine that this internal mapping is private/internal to CMAP. Is there any way this resource could be made public on the CLUE website? It would make processing and data provenance easier to track for this repo (and other projects too).

I can add old IDs to the deprecated_broad_id column in a future version. I'll admit that the process I used to pull them was likely not complete. I'm not sure the best way to handle compounds that were removed. Perhaps for the next incarnation, we just leave them in the database and add a display boolean for the site. That way, they'd still be obtainable via the API.

In what format would you like me to send you a list of Broad IDs? A one column text file? Do you need any other metadata info?

Yup! Just the full Broad IDs (BRD-X12345678-001-01-9) would be perfect if you have them. Otherwise, I can work from plate barcodes or possibly other data. If you want to include any other metadata that would be relevant to you, feel free.

Would you have an estimated time/effort that this would take? I think we have a pretty good sense of effort to pursue the alternative method - I imagine querying the Database with a file would be fairly straightforward. Please let us know if not!

It seems like this should only take 10-15 minutes, which means it will take about 2-3 hours 😁. I'm happy to do this, though, as it will really help REPO in the future.

If you could provide them, I'd appreciate your thoughts about columns/fields that would be helpful to see in a final file. I'll add as much as I can.

gwaybio commented 4 years ago

Yup! Just the full Broad IDs (BRD-X12345678-001-01-9) would be perfect if you have them.

Great - here is the full list(added just now 4d37dd7b7a66ccfe7b8af3336416161c2c2b017c in #10 )

It seems like this should only take 10-15 minutes, which means it will take about 2-3 hours 😁. I'm happy to do this, though, as it will really help REPO in the future.

I know the feeling 😆

If you could provide them, I'd appreciate your thoughts about columns/fields that would be helpful to see in a final file. I'll add as much as I can.

We'd love to see (in the future and assuming that it's easy to retrieve) a list of all broad samples ever used, a map to their current sample and broad ID, the version where it has been deprecated (if it was), and a reason for its deprecation (can be coded*). I think these details are necessary and sufficient to map between all Repurposing Hub versions and provide a reference to why certain samples are dropped. Having granularity in the reference (i.e. the code) will help researchers determine the extent to which a perturbation can be trusted. (for example, deprecated b/c the compound was insoluble is different than deprecated b/c the compound can no longer be purchased - the insoluble deprecation isn't likely to lead to a trusted profile!)

/* for reasons you mentioned (i.e. insoluble inorganics, polymers, etc.)

Also maybe important to note: the annotated map from broad_sample_info.tsv (see above) to current broad IDs is of much higher priority than the full broad sample map. The former benefits this project (and thus the many, many extensions of these data) while the latter will benefit the larger community.

gwaybio commented 4 years ago

Also, to chat more about these ideas @jrsacher - I think it would be best to open a new issue (can be in this repo, or in a different repo of your choosing). I'm happy to help with this and to bounce ideas off of! Let's keep the discussion in this issue to updating old Broad IDs

gwaybio commented 4 years ago

Getting to the bottom of this. I compared the broad IDs profiled across the Cell Painting plates (see #10 ) to every single broad id in every CLUE resource (four dates: "20170327", "20180516", "20180907", "20200324") including deprecated broad ids (see #13). Pull request incoming to add this code.

There are 18 broad IDs in the Cell Painting plates that are not described in any available resource we've explored in this repo:

['BRD-K04887706',
 'BRD-K41996876',
 'BRD-A62025033',
 'BRD-A20131130',
 'BRD-K03816923',
 'BRD-A67373739',
 'BRD-K73395020',
 'BRD-K01192156',
 'BRD-A84045418',
 'BRD-K60623809',
 'BRD-A77216878',
 'BRD-K36324071',
 'BRD-A58280226',
 'BRD-A69951442',
 'BRD-A43331270',
 'BRD-K41895714',
 'BRD-K23875128',
 'BRD-K03842655']

gwaybio commented 4 years ago

There are also an additional 45 broad_ids that do not align (using InChIKey14 as an intermediate) to clue annotations in the most up to date resources (20200324). CP_InChIKey14_missing_in_20200324.txt. (PR incoming)

These broad_ids do have annotations in the 20170327 CLUE resource.

Recommendation

We tried really hard to get these annotations accurate! Here is a current summary of broad_id annotations:

Description	Count
Total Broad IDs	1514
Broad IDs with complete MOA	1450
Broad IDs with complete target	1225
Broad IDs with complete MOA and target	1224
Broad IDs missing annotations in all CLUE resources	18
Broad IDs missing annotations in `20200324`	45

There is zero overlap between Broad IDs in the two "missing" rows of the table above.

So, we could do two things:

We could try to extend annotations and recover 45 - 18 = 27 additional annotations by using CLUE resources older than 20200324.
@jrsacher could cross-reference internal files to try to recover the 18 missing broad_ids.

I think this is overkill! We might want to leave these exercises for future annotation upgrades.

@shntnu @niranjchandrasekaran - thoughts on next steps are welcome

gwaybio commented 4 years ago

We discussed a solution during profiling checkin today, which I will summarize below:

Given that broad_ids do not easily map across versions with any column indicator, we should use the most complete indicator.
InChIKey14 seems to be the best indicator, but there are still some issues (see #17)
- From @jrsacher "Unfortunately, there's not really one "perfect" unique ID for chemicals. I think InChI14 may come close, but may not distinguish stereoisomers."
To solve the different stereoisomer issues, we will create an alternate_moa and alternate_target column in the cases where the same InChiKey14 maps to two different moa/targets on the basis of different stereochemistry.

In the updated profiles we will be clear about how we're deriving the moa/target annotations (both in the documentation and by providing the processing code).

Next Steps

I have a PR that is just about ready to go to generate the map. I will file this PR and ask @niranjchandrasekaran to review. From there, I will proceed with the steps outlined here https://github.com/broadinstitute/lincs-cell-painting/issues/4#issue-577535384

shntnu commented 1 year ago

@niranjchandrasekaran Q for you when you get the chance

Does this file https://github.com/broadinstitute/lincs-cell-painting/blob/061870127481dcd73c29df85ebcfddeac2ed0828/metadata/moa/clue/repurposing_drugs_20200324.txt

solve any of these mapping problems fully?

(1) I have the Broad ID of a compound, or (2) I have the InChIKey of a compound ... and I want to know its MOA (as deposited in clue.io/repurposing)

If it doesn't, where does it fail?

niranjchandrasekaran commented 1 year ago

I don't believe that file will solve Yu's mapping issue. The map that Greg and I created helps map compounds across the different versions of the repurposing hub list, but it seems to fail for new CDoT experiments.

IIUC, our problem is the following

When CDoT shares a platemap for a new experiment, the broad_sample names in those files don't match the broad_sample names in the repurposing hub lists. Using the first 13 characters of broad_sample names does help, but even then, not all compounds seem to map (that's what Yu said yesterday). We also have the one to many (broad_sample to pert_iname) mapping issue, but we may be able to alleviate that by combining all the pert_inames and their moa and target maps.

I had similar problems with Target-1 and Target-2, but since we designed the plates ourselves, I knew exactly how the broad_sample names given by CDoT mapped to our list of broad_sample names. Since we didn't design the adipocyte platemaps, that has become an issue.

niranjchandrasekaran commented 1 year ago

I had similar problems with Target-1 and Target-2, but since we designed the plates ourselves, I knew exactly how the broad_sample names given by CDoT mapped to our list of broad_sample names.

Actually, I was able to solve the mapping problem easily only for Target-1. Target-2 plate map was designed by Anita, so I wasn't able to map it readily to Target-1. Luckily, Anita also shared the pert_inames of compounds in Target-2, along with the platemap.

Based on the experience with Target-2, I guess it would be nice to get the pert_inames of compounds from CDoT instead of broad_sample names.

broadinstitute / lincs-cell-painting

Old/Updated Broad IDs #11

@niranjchandrasekaran

@shntnu

@jrsacher

Recommendation

Next Steps