Closed niranjchandrasekaran closed 4 years ago
Given what we discussed and concluded at the profiling checking (https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-618480910) I will proceed with using InChiKey14
.
Concretely, this means merging #18 and then I will file a PR to generate the map.
@gwaygenomics I can begin by creating a new PR by modifying the mapping code (2.map-broad_id.ipynb) and perhaps you could re-run the rest of the pipeline to generate a table similar to #11 (comment)?
Right now the table is manual, but I should try to automate.
My current assumption is that there will many more than 45 compounds from the repurposing profiles dataset that do not map to most recent broad_ids but I believe it will be useful to know the actual number.
I agree with the assumption, but I don't know how the actual number will help us. It might though and I'm just not thinking of it!
So, I think the path should be to proceed with how we discussed and then revisit if indeed the full InChIKey
helps us in some way.
I agree with the assumption, but I don't know how the actual number will help us. It might though and I'm just not thinking of it!
So, I think the path should be to proceed with how we discussed and then revisit if indeed the full
InChIKey
helps us in some way.
I was interested in the comparing the numbers just to be sure that using InChIKey14
was a real advantage over using InChIKey
. If gains were marginal then using InChIKey
wouldn't be a bad idea as we would still retain stererochemistry information. But given the new solution (https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-618480910), I agree with you that we should proceed as planned and worry about InChIKey
later.
I am going to close this issue for now - if we decide to revisit this question later, lets reopen it.
I had previously settled on using
InChIKey14
as the common field for mapping across different repurposing hub versions (#13) partly due to the success in manually mapping three compounds across all the versions (https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-612176739). Also, since there are only 45/1514 compounds (https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-618077262) from the repurposing profiles dataset that do not map to anybroad_id
s in the most recent repurposing hub version (20200324
), this approach may be the most effective.But given #17, it may be worth repeating this pipeline with
InChIKey
as the common field for merging asInChIKey
does uniquely identify stereoisomers. My current assumption is that there will many more than 45 compounds from the repurposing profiles dataset that do not map to most recentbroad_id
s but I believe it will be useful to know the actual number.@gwaygenomics I can begin by creating a new PR by modifying the mapping code (
2.map-broad_id.ipynb
) and perhaps you could re-run the rest of the pipeline to generate a table similar to https://github.com/broadinstitute/lincs-cell-painting/issues/11#issuecomment-618077262?