broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

Using InChIKey as the common field for mapping #19

Closed niranjchandrasekaran closed 4 years ago

niranjchandrasekaran commented 4 years ago

I had previously settled on using InChIKey14 as the common field for mapping across different repurposing hub versions (#13) partly due to the success in manually mapping three compounds across all the versions (https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-612176739). Also, since there are only 45/1514 compounds (https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-618077262) from the repurposing profiles dataset that do not map to any broad_ids in the most recent repurposing hub version (20200324), this approach may be the most effective.

But given #17, it may be worth repeating this pipeline with InChIKey as the common field for merging as InChIKey does uniquely identify stereoisomers. My current assumption is that there will many more than 45 compounds from the repurposing profiles dataset that do not map to most recent broad_ids but I believe it will be useful to know the actual number.

@gwaygenomics I can begin by creating a new PR by modifying the mapping code (2.map-broad_id.ipynb) and perhaps you could re-run the rest of the pipeline to generate a table similar to https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-618077262?

gwaybio commented 4 years ago

Given what we discussed and concluded at the profiling checking (https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-618480910) I will proceed with using InChiKey14.

Concretely, this means merging #18 and then I will file a PR to generate the map.

@gwaygenomics I can begin by creating a new PR by modifying the mapping code (2.map-broad_id.ipynb) and perhaps you could re-run the rest of the pipeline to generate a table similar to #11 (comment)?

Right now the table is manual, but I should try to automate.

My current assumption is that there will many more than 45 compounds from the repurposing profiles dataset that do not map to most recent broad_ids but I believe it will be useful to know the actual number.

I agree with the assumption, but I don't know how the actual number will help us. It might though and I'm just not thinking of it!

So, I think the path should be to proceed with how we discussed and then revisit if indeed the full InChIKey helps us in some way.

niranjchandrasekaran commented 4 years ago

I agree with the assumption, but I don't know how the actual number will help us. It might though and I'm just not thinking of it!

So, I think the path should be to proceed with how we discussed and then revisit if indeed the full InChIKey helps us in some way.

I was interested in the comparing the numbers just to be sure that using InChIKey14 was a real advantage over using InChIKey. If gains were marginal then using InChIKey wouldn't be a bad idea as we would still retain stererochemistry information. But given the new solution (https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-618480910), I agree with you that we should proceed as planned and worry about InChIKey later.

gwaybio commented 4 years ago

I am going to close this issue for now - if we decide to revisit this question later, lets reopen it.