broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

InChIKey14s can contain duplicate MOA/Target Info #17

Closed gwaybio closed 4 years ago

gwaybio commented 4 years ago

In #12 we used InChIKey14 to map broad_ids and in #11 we discussed why this is important.

While processing some data, I noticed that InChiKey14s do not map uniquely to MOA and Targets. I guess this is not surprising given that drugs are often used for different indications in various clinical phases, but it is worth documenting here! It is dangerous to use InChIKeys14s to map directly to MOA/Targets.

For example, InChIKey14 KTEIFNKAUNYNJU maps to two MOA/Targets. However, it looks like the full InChIKey does map uniquely. I didn't comprehensively explore this.

image

@niranjchandrasekaran - maybe I missed this, but was there a reason to use InChiKey14 instead of the full InChiKey?

niranjchandrasekaran commented 4 years ago

@gwaygenomics I used InChIKey14 instead of InChIKey because the latter suffers from the same problem as broad_id, which is, both account for a compound's stereochemistry. If compounds have different stereochemistry across different repurposing hub versions, we wouldn't be able to map across versions. In your example above (https://github.com/broadinstitute/lincs-cell-painting/issues/17#issue-605122285) the first four rows represent one isomer while the last three represent another. The broad_id and InChIKey are the same for the first four compounds and the last three compounds while InChIKey14 is the same across all of them.

As we briefly discussed in https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-612176739, ignoring stereochemistry may not be ideal. If different stereoisomers have different MOA annotations that are significantly different, perhaps the strategy of using InChIKey14 as the common field for mapping across the different repurposing hub versions is inadequate.

gwaybio commented 4 years ago

We discussed this issue in the profiling checkin - the full summary is here https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-618480910

The pertinent info for this issue is:

To solve the different stereoisomer issues, we will create an alternate_moa and alternate_target column in the cases where the same InChiKey14 maps to two different moa/targets on the basis of different stereochemistry.

gwaybio commented 4 years ago

Concretely, the profiles for the compound above would look like this:

Metadata_broad_id Metadata_moa Metadata_target Metadata_alternative_moa Metadata_alternative_target
BRD-K78431006 (or whichever 2016 Broad ID matches to InChiKey14 KTEIFNKAUNYNJU) ALK tyrosine kinase receptor inhibitor ALK,MET MTH1 inhibitor NUDT1

We will also have to make some manual ordering decisions (i.e. which one is primary and alternative moa).

niranjchandrasekaran commented 4 years ago

@gwaygenomics I believe the markdown renderer mistook the pipe between ALK and MET to indicate column separation in the markdown table. Just wanted to bring that to your attention.

gwaybio commented 4 years ago

thanks - updated