Closed gwaybio closed 4 years ago
Update I am currently scanning the deprecated_broad_id
column more closely. It appears that there are sometimes multiple deprecated ids separated by pipes. It's still possible this column stores the answer we're looking for.
Either way, this begs the question: which pert_ids
should we include in the updated Cell Painting profiles? The updated pert_ids
will be more comparable to future Drug Repurposing Hub data, but old pert_ids
will be more comparable to previous internal analyses performed using these data π€
It's still possible this column stores the answer we're looking for.
Nevermind, this does not resolve the issue. In #12 I add a map of new to deprecated IDs and include all deprecated IDs separated by pipes. The same three profiles (as mentioned above) are still the only three resolved.
There must be a map somewhere! π
It looks like the repurposing samples file from 2018 and 2020 have some differences. I haven't compared the files thoroughly but when I manually looked for the first 3 BRD IDs in your missing compounds list (#11 Comment), I found all three of them in the 2018 file while all three are missing in 2020 file. I am not sure if they are actually missing or if they have been assigned alternate BRD IDs (I didn't look carefully enough). Do you think this could explain the missing compounds list?
Some more info...
BRD-A69275535
corresponds to pinitol in the 2018 file; pinitol is present in the 2020 file but with a different ID - BRD-K87873585
BRD-A69636825
corresponds to diltiazem in the 2018 file; the 2018 file has four diltiazem entries but the ID of other three entries is BRD-K24023109.
In the 2020 file all four entries have the ID BRD-K24023109
BRD-A69815203
corresponds to cyclosporin-a in the 2018 file; there are 5 entries for cyclosporin-A in the 2020 file but all have the ID BRD-K13533483
. Curiously in the 2018 file cyclosporine has the ID BRD-K13533483
@jrsacher has previously said (Aug 13, 2019)
The more interesting/difficult list is when 1 name has more than 1 Core ID (see Actinomycin-d, for instance). There will be ~400 curations/changes in the next update to begin to address this.
It is possible that the curation resulted in ids being deprecated but not being added to the deprecated_broad_id
list. @jrsacher can hopefully confirm.
Regarding which broad_id
to use once this is resolved: generally my preference would be to use the broad_id
that was active when this experiment was run. But given that we are using updated repurposing hub metadata here, that would get a bit confusing.
But assuming the missing deprecated id problem is solved, it shouldn't really matter. e.g. this problem will go away
BRD-A69275535
corresponds to pinitol in the 2018 file; pinitol is present in the 2020 file but with a different ID -BRD-K87873585
So the way to address this is to tidy the metadata file i.e. pull out the deprecated ids into separate rows, then do an inner join with the old metadata files for LINCS on the broad_id
column and we are all set. Does that make sense? LMK if you want to delegate any of this to either of us, @gwaygenomics
It turns out keying a database on something that can change -- like a compound name or a structure ID -- is not a great idea. Unfortunately, there's not really one "perfect" unique ID for chemicals. I think InChI14 may come close, but may not distinguish stereoisomers. Add into all that the fact that we migrated applications/databases that handle our chemical inventory between the two updates and I'm honestly surprised it's not more of a mess.
We may be restructuring the DB in the future, so I'm open to suggestions about how to best handle this.
Some details:
The internal Core ID (pert_id
, BRD-[AKMU]00000000) is specific to a chemical structure. For some compounds, like cyclosporin, we may have received multiple structures from different vendors, which results in more than one Core ID mapping to a name. I've tried to curate them so that all pert_iname
s map to one pert_id
, but I may have missed some.
Alternatively, pert_inames
can be recorded incorrectly or change. There were some spelling mistakes, some of the same things recorded as separate entries that should be identical (cyclosporin/cyclosporin-a), and some updates (a drug getting an INN when moving from clinical to approved, for example).
Attached is a file with curation information I can pull on those Core IDs listed above. Let me know what other digging I can do! DotmaticsExport.xlsx
I am not sure if they are actually missing or if they have been assigned alternate BRD IDs (I didn't look carefully enough). Do you think this could explain the missing compounds list?
I do. Especially given evidence you cite in https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-611800240. Thanks for digging deeper into this π΅οΈ I think the core of the issue is as @shntnu mentioned:
It is possible that the curation resulted in ids being deprecated but not being added to the deprecated_broad_id list
generally my preference would be to use the broad_id that was active when this experiment was run. But given that we are using updated repurposing hub metadata here, that would get a bit confusing.
I agree π― . This is very tricky... the CP platemaps themselves contain deprecated broad ids and it would not make much sense to update these since there are differences (as @jrsacher notes)
The internal Core ID (pert_id, BRD-[AKMU]00000000) is specific to a chemical structure. For some compounds, like cyclosporin, we may have received multiple structures from different vendors, which results in more than one Core ID mapping to a name.
Therefore, I am in favor of resolving the deprecated IDs in the moa metadata in this repo, adding them to the existing files, and then proceeding as planned. I agree that the best way forward is as you suggest:
pull out the deprecated ids into separate rows, then do an inner join with the old metadata files for LINCS on the broad_id column
Concretely:
deprecated_broad_id
column by pipe (see #12 )repurposing_simple.tsv
(and maybe the other two files as well). We can also add a boolean column (is_deprecated
). repurposing_simple.tsv
ignoring deprecation status. This will resolve some moa/target assignments. (e.g. in SQ00014814
this will resolve 3 of 12 missing metadata)An easy solution for 4 is to use the Repurposing Hub metadata from 2018 which we know would map to the CP platemaps. This doesn't feel like the way to go though for some reason...
Add into all that the fact that we migrated applications/databases that handle our chemical inventory between the two updates and I'm honestly surprised it's not more of a mess.
Totally understandable - data curation is not easy and it is extremely important. Also important to note that this is only one plate of 137. So it is definitely going to get messier π
Thanks again for your prompt response - it is greatly appreciated!
so I'm open to suggestions about how to best handle this.
I think your previous solution of adding deprecated IDs to a pipe separated deprecated_broad_id
column is the right way to go. If this process is not already automated, then I strongly recommend limiting manual steps - this process is essential for mapping legacy data and I totally get how easy it can be to fall behind.
This solution is also how folks in the genetics community map gene names. They've been thinking about this problem for quite some time now. Gene symbols (e.g. like KRAS
, TP53
) have so many synonyms. Resources that map gene names across synonyms and identifiers are essential (see https://mygene.info/). π this is a pretty heavyweight solution, but it does illustrate how important these things are :)
How do you suggest we proceed with step 4 above?
Is it worth going through the other Repurposing Hub sample/drug versions and pulling out every unique broad_id
?
Also, @shntnu
LMK if you want to delegate any of this to either of us, @gwaygenomics
It is my understanding that this needs to happen in the JUMP-CP project as well. Given that we should avoid duplicating efforts, if you and @niranjchandrasekaran want to tag team on the heavy lifting, I'd be happy reviewing pull requests that add these details to the repo.
I am mostly thinking out loud here (@gwaygenomics please correct me if what I say doesn't make sense). The main purpose of this exercise is to map profiles of repurposing drugs to the latest MOAs by taking the following path
Broad ID of profiles -> repurposing_samples 2020 broad_id
-> repurposing_samples 2020 pert_iname
-> repurposing_drugs 2020 pert_iname
-> repurposing_drugs 2020 target
But, it looks like the step - Broad ID of profiles -> repurposing_samples 2020 broad_id
is probably not possible for all drugs.
Coming to my proposed solution, based on Josh's suggestion in #11 comment, if we used InChI14 (the first 14 characters of InChIKey), the above path will change into the following
Broad ID of profiles -> repurposing_samples 2018 broad_id
-> repurposing_samples 2018 InChIKey
InChI14 -> repurposing_samples 2020 InChIKey
InChI14 -> repurposing_samples 2020 pert_iname
-> repurposing_drugs 2020 pert_iname
-> repurposing_drugs 2020 target
For the 3 examples in #11 comment by taking this path, I am able to map broad_id
2018 to broad_id
2020. As Josh noted in #11 comment, InChI14 does not capture stereochemistry which is a drawback of this approach. But my question is, for the purpose of mapping MOAs, does the stereochemistry matter? Given biology does care about stereochemistry, it probably does matter but for the 3 examples, it doesn't seem to.
2018 pert_iname |
2020 pert_iname |
2018 broad_id |
moa |
2018 target |
2020 target |
---|---|---|---|---|---|
pinitol | pinitol | BRD-A69275535 |
gamma secretase inhibitor | ||
diltiazem | diltiazem | BRD-A69636825 |
calcium channel blocker | CACNA1C CACNA1S CACNA2D1 CACNG1 HTR3A KCNA5 |
CACNA1C CACNA1S CACNA2D1 CACNG1 HTR3A KCNA5 |
cyclosporin-a | cyclosporin-A | BRD-A69815203 |
calcineurin inhibitor | PPP3CA | ABCB11 CAMLG FPR1 PPIA PPIF PPP3CA PPP3R2 SLC10A1 SLCO1B1 SLCO1B3 |
Note:
2018 cyclosporin-a | 2018 cyclosporine | 2020 cyclosporine-A |
---|---|---|
PPP3CA | ABCB11 CAMLG FPR1 PPIA PPIF PPP3R2 SLC10A1 SLCO1B1 SLCO1B3 |
ABCB11 CAMLG FPR1 PPIA PPIF PPP3CA PPP3R2 SLC10A1 SLCO1B1 SLCO1B3 |
There is a good chance that these three compounds are outliers. Perhaps we should compare the entire 2018 and 2020 lists to figure out if using InChI14 for mapping is a viable strategy and if losing stereochemistry information is inconsequential for MOA assignment in this dataset.
Broad ID of profiles -> repurposing_samples 2018 broad_id -> repurposing_samples 2018 InChIKey InChI14 -> repurposing_samples 2020 InChIKey InChI14 -> repurposing_samples 2020 pert_iname-> repurposing_drugs 2020 pert_iname -> repurposing_drugs 2020 target
YES! I love this approach
for the purpose of mapping MOAs, does the stereochemistry matter? Given biology does care about stereochemistry, it probably does matter but for the 3 examples, it doesn't seem to.
If we're going to lose signal somewhere, I think this solution minimizes it.
Perhaps we should compare the entire 2018 and 2020 lists to figure out if using InChI14 for mapping is a viable strategy and if losing stereochemistry information is inconsequential for MOA assignment in this dataset.
I am for this exercise. We might as well do it programmatically for every drug. I think it is also correct to quantify how many instances does the stereoisomer difference also correspond to moa differences. It will be tough (maybe impossible?) to tease apart stereoisomer differences from regular updates between 2018 and 2020.
@shntnu - one thing to confirm before @niranjchandrasekaran embarks on this analysis is to confirm the Drug Repurposing hub version used in the Cell Painting platemaps. It might not be the 2018 version given that the data were collected in 2015.
worth mentioning that I am for this approach and not just using old Broad IDs because we should leverage new MOA and target data @jrsacher and others worked to update.
It might not be the 2018 version given that the data were collected in 2015.
The broad_id
in the 28 platemaps files added here https://github.com/broadinstitute/lincs-cell-painting/pull/10/files are from 2015, if that answers your question.
The drug/sample info from CLUE starts at 2017 - https://clue.io/repurposing#download-data. Do we know which version of these resources we used? Maybe it is not listed here...
Checking manually, all the missing pert_ids
on the plate SQ00014814
maps to broad_id
of at least one entry in 2018 version except for BRD-K81258678
which maps to a deprecated_broad_id
. Given the 100% match, shall I assume that a combination of all pert_ids
will maps of either broad_id
or deprecated_broad_id
? Or Is there a quick way to confirm this? @gwaygenomics is there a way for me to access the pert_ids
of compounds on all the plates?
The drug/sample info from CLUE starts at 2017 - https://clue.io/repurposing#download-data. Do we know which version of these resources we used? Maybe it is not listed here...
Not listed here apparently, but 2017 comes close.
*Note @gwaygenomics edit adding collapsible section (to reduce issue bloat)
Given @shntnu's analysis, I will modify my approach to the following
Profiles pert_id
-> repurposing_samples 2017 broad_id -> repurposing_samples 2017 InChIKey InChI14 -> repurposing_samples 2020 InChIKey InChI14 -> repurposing_samples 2020 pert_iname-> repurposing_drugs 2020 pert_iname -> repurposing_drugs 2020 target
I will think about this more carefully on Monday to make sure it makes sense.
thinking about this a bit more... what do you think of separating this deprecated ID map from the versioned broad id drugs/samples?
in other words, @niranjchandrasekaran's workflow would create a two/three column dictionary of:
broad_id | deprecated_broad_id | deprecated_version |
---|---|---|
BRD-K87873585 | BRD-A69275535 | 2018 |
BRD-K24023109 | BRD-A69636825 | 2018 |
BRD-K13533483 | BRD-A69815203 | 2018 |
This will enable us to use this file as a intermediate step to map any repurposing hub version to moa/target info. This may solve both goals:
Metadata_broad_id
in the Repurposing Hub Cell Painting DatasetMetadata_broad_id
regardless of deprecation yearAlso, maybe worth mentioning that there seems to be two roads to deprecation (and maybe they are actually equivalent)
deprecated_broad_id
by the CLUE teamMaybe we can document this as well in a fourth column π€·ββοΈ@niranjchandrasekaran - does this make sense, would it work?
@gwaygenomics That's a great idea to have intermediate maps instead of a single map from profile Broad ID to targets/moa. I will look into what format would work best.
One additional thought that might help:
If you have the plate barcodes that you've tested over the years, we could likely run them to pull the current Broad IDs. This could give you a mapping on a per-well level.
Depending on what Compound Management has provided you over the years, there's also an internal Sample ID (starts with SA
) that is a unique identifier in the CBIP database.
Most (if not all) of the compounds we silently dropped in this version had to be excluded for regulatory reasons. A few were dropped because they aren't compatible with screening (insoluble inorganics, polymers, etc.). If you need that list, I can provide it.
If you have the plate barcodes that you've tested over the years, we could likely run them to pull the current Broad IDs.
For the specific dataset in this repo (Cell Painting of a subset of Drug Repurposing Hub Compounds) as long as this exercise maps everything to current Broad IDs and the current Broad IDs have moa/target info then, this is the minimum requirement to proceed with metadata annotation.
In other words, if we provide a list of all unique Broad IDs in this project and then a current Broad ID list is retrieved using clue.io-specific resources that provides a 1-1 mapping of moa/target info, we're all set for lincs_cell_painting
(after a quick fidelity check).
Essentially, this is what @niranjchandrasekaran has proposed to do for all repurposing hub broad IDs by using intermediate files and columns. Maybe (especially for the JUMP-CP project) we want to create the deprecated map in addition to solving this repo's specific issue?
@jrsacher - this is great. A couple of followup questions:
I imagine that this internal mapping is private/internal to CMAP. Is there any way this resource could be made public on the CLUE website? It would make processing and data provenance easier to track for this repo (and other projects too).
I can add old IDs to the deprecated_broad_id
column in a future version. I'll admit that the process I used to pull them was likely not complete. I'm not sure the best way to handle compounds that were removed. Perhaps for the next incarnation, we just leave them in the database and add a display
boolean for the site. That way, they'd still be obtainable via the API.
In what format would you like me to send you a list of Broad IDs? A one column text file? Do you need any other metadata info?
Yup! Just the full Broad IDs (BRD-X12345678-001-01-9) would be perfect if you have them. Otherwise, I can work from plate barcodes or possibly other data. If you want to include any other metadata that would be relevant to you, feel free.
Would you have an estimated time/effort that this would take? I think we have a pretty good sense of effort to pursue the alternative method - I imagine querying the Database with a file would be fairly straightforward. Please let us know if not!
It seems like this should only take 10-15 minutes, which means it will take about 2-3 hours π. I'm happy to do this, though, as it will really help REPO in the future.
If you could provide them, I'd appreciate your thoughts about columns/fields that would be helpful to see in a final file. I'll add as much as I can.
Yup! Just the full Broad IDs (BRD-X12345678-001-01-9) would be perfect if you have them.
Great - here is the full list(added just now 4d37dd7b7a66ccfe7b8af3336416161c2c2b017c in #10 )
It seems like this should only take 10-15 minutes, which means it will take about 2-3 hours π. I'm happy to do this, though, as it will really help REPO in the future.
I know the feeling π
If you could provide them, I'd appreciate your thoughts about columns/fields that would be helpful to see in a final file. I'll add as much as I can.
We'd love to see (in the future and assuming that it's easy to retrieve) a list of all broad samples ever used, a map to their current sample and broad ID, the version where it has been deprecated (if it was), and a reason for its deprecation (can be coded*). I think these details are necessary and sufficient to map between all Repurposing Hub versions and provide a reference to why certain samples are dropped. Having granularity in the reference (i.e. the code) will help researchers determine the extent to which a perturbation can be trusted. (for example, deprecated b/c the compound was insoluble is different than deprecated b/c the compound can no longer be purchased - the insoluble deprecation isn't likely to lead to a trusted profile!)
/* for reasons you mentioned (i.e. insoluble inorganics, polymers, etc.)
Also maybe important to note: the annotated map from broad_sample_info.tsv
(see above) to current broad IDs is of much higher priority than the full broad sample map. The former benefits this project (and thus the many, many extensions of these data) while the latter will benefit the larger community.
Also, to chat more about these ideas @jrsacher - I think it would be best to open a new issue (can be in this repo, or in a different repo of your choosing). I'm happy to help with this and to bounce ideas off of! Let's keep the discussion in this issue to updating old Broad IDs
Getting to the bottom of this. I compared the broad IDs profiled across the Cell Painting plates (see #10 ) to every single broad id in every CLUE resource (four dates: "20170327", "20180516", "20180907", "20200324"
) including deprecated broad ids (see #13). Pull request incoming to add this code.
There are 18 broad IDs in the Cell Painting plates that are not described in any available resource we've explored in this repo:
['BRD-K04887706',
'BRD-K41996876',
'BRD-A62025033',
'BRD-A20131130',
'BRD-K03816923',
'BRD-A67373739',
'BRD-K73395020',
'BRD-K01192156',
'BRD-A84045418',
'BRD-K60623809',
'BRD-A77216878',
'BRD-K36324071',
'BRD-A58280226',
'BRD-A69951442',
'BRD-A43331270',
'BRD-K41895714',
'BRD-K23875128',
'BRD-K03842655']
There are also an additional 45 broad_ids
that do not align (using InChIKey14 as an intermediate) to clue annotations in the most up to date resources (20200324
). CP_InChIKey14_missing_in_20200324.txt. (PR incoming)
These broad_ids
do have annotations in the 20170327
CLUE resource.
We tried really hard to get these annotations accurate! Here is a current summary of broad_id annotations:
Description | Count |
---|---|
Total Broad IDs | 1514 |
Broad IDs with complete MOA | 1450 |
Broad IDs with complete target | 1225 |
Broad IDs with complete MOA and target | 1224 |
Broad IDs missing annotations in all CLUE resources | 18 |
Broad IDs missing annotations in 20200324 |
45 |
There is zero overlap between Broad IDs in the two "missing" rows of the table above.
So, we could do two things:
20200324
.I think this is overkill! We might want to leave these exercises for future annotation upgrades.
@shntnu @niranjchandrasekaran - thoughts on next steps are welcome
We discussed a solution during profiling checkin today, which I will summarize below:
alternate_moa
and alternate_target
column in the cases where the same InChiKey14 maps to two different moa/targets on the basis of different stereochemistry.In the updated profiles we will be clear about how we're deriving the moa/target annotations (both in the documentation and by providing the processing code).
I have a PR that is just about ready to go to generate the map. I will file this PR and ask @niranjchandrasekaran to review. From there, I will proceed with the steps outlined here https://github.com/broadinstitute/lincs-cell-painting/issues/4#issue-577535384
@niranjchandrasekaran Q for you when you get the chance
solve any of these mapping problems fully?
(1) I have the Broad ID of a compound, or (2) I have the InChIKey of a compound ... and I want to know its MOA (as deposited in clue.io/repurposing)
If it doesn't, where does it fail?
I don't believe that file will solve Yu's mapping issue. The map that Greg and I created helps map compounds across the different versions of the repurposing hub list, but it seems to fail for new CDoT experiments.
IIUC, our problem is the following
When CDoT shares a platemap for a new experiment, the broad_sample
names in those files don't match the broad_sample
names in the repurposing hub lists. Using the first 13 characters of broad_sample
names does help, but even then, not all compounds seem to map (that's what Yu said yesterday). We also have the one to many (broad_sample
to pert_iname
) mapping issue, but we may be able to alleviate that by combining all the pert_iname
s and their moa
and target
maps.
I had similar problems with Target-1 and Target-2, but since we designed the plates ourselves, I knew exactly how the broad_sample
names given by CDoT mapped to our list of broad_sample
names. Since we didn't design the adipocyte platemaps, that has become an issue.
I had similar problems with Target-1 and Target-2, but since we designed the plates ourselves, I knew exactly how the broad_sample names given by CDoT mapped to our list of broad_sample names.
Actually, I was able to solve the mapping problem easily only for Target-1. Target-2 plate map was designed by Anita, so I wasn't able to map it readily to Target-1. Luckily, Anita also shared the pert_iname
s of compounds in Target-2, along with the platemap.
Based on the experience with Target-2, I guess it would be nice to get the pert_iname
s of compounds from CDoT instead of broad_sample
names.
I have encountered perhaps a significant hurdle in adding Cell Painting Repurposing Hub profiles to this repo.
There are broad ids (
pert_id
) in the profile data that are absent from the updated moa information.For example, in one plate (
SQ00014814
) the followingpert_ids
are present (with annotations) in the profile data, but are absent in the repurposing moa files in this repo:Given that these
pert_ids
have annotations in cytominer-derived profiles, this indicates that thepert_ids
have changed somewhere.Before I pursue this issue, I was wondering if there are any known solutions or datasets that map old to updated
pert_ids
. cc @shntnu @niranjchandrasekaranPerhaps also @jrsacher has insight here. Josh, I scanned the CLUE and DepMap resources and was not able to find a map. I also checked the column
deprecated_broad_id
and I was able to recover 3 of the profiles (['BRD-K50691590', 'BRD-K50691590', 'BRD-K81258678']).Any insights or pointers here would be greatly appreciated!