RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Should KEGG:C00161 be treated as a synonym of CHEBI:35910? #1327

Closed chunyuma closed 3 months ago

chunyuma commented 3 years ago

Hi @edeutsch,

Perhaps this might be another issue for NodeSynonymizer. So I tried to check the node source distribution of biolink:Metabolite in today's kg2.5.2c. As you can see below, most of them are from KEGG:

Screen Shot 2021-03-24 at 5 41 33 PM

If you check some of the KEGG curies in kg2.5.2c, their equivalent_curies are just themselves.

n.id n.equivalent_curies
"KEGG:C21457" ["KEGG:C21457"]
"KEGG:C04167" ["KEGG:C04167"]
"KEGG:C21458" ["KEGG:C21458"]
"KEGG:C04170" ["KEGG:C04170"]
"KEGG:C21459" ["KEGG:C21459"]
"KEGG:C21460" ["KEGG:C21460"]
"KEGG:C21461" ["KEGG:C21461"]
"KEGG:C21462" ["KEGG:C21462"]
"KEGG:C04180" ["KEGG:C04180"]
"KEGG:C21463" ["KEGG:C21463"]

If you take a look at KEGG:C00161, its equivalent_curies are just itself and PathWhiz.ElementCollection:552.

n.id n.equivalent_curies
"KEGG:C00161" ["KEGG:C00161", "PathWhiz.ElementCollection:552"]

However, you check KEGG database for KEGG:C00161 Screen Shot 2021-03-24 at 5 55 16 PM

It should have a synonym CHEBI:35910: Screen Shot 2021-03-24 at 5 56 24 PM

But right now, in kg2.5.2c, they are not clustered together and CHEBI:35910 is biolink:MolecularEntity. Screen Shot 2021-03-24 at 6 02 53 PM

It might have other KEGG curies that are like this case.

edeutsch commented 3 years ago

Hi @chunyuma thanks for the report. I see the issue, but I don't think the NodeSynonymizer can help here. As far as I can tell, KEGG:C00161 and CHEBI:35910 are not linked to each other via registered equivalency in KG2.5.2 or in the SRI Node Normalizer. And their names are different. So there's no way currently to link them by the NodeSynonymizer unless I put in a manual exception. which I could.

It seems to me that we should ask the KG2 team to see if the equivalencies to ChEBI:35910 and PubChem can be made during ETL of KEGG?. Tagging @ericawood and @saramsey for an assessment. Let me know if you think I've misunderstood.

ecwood commented 3 years ago

As far as I can tell, KEGG is "pay to use" (see https://www.pathway.jp/en/academic.html). Thus, we can't ETL KEGG directly without paying for a license. The parts of KEGG that are currently in KG2 are remnants of KG1 (see code below): https://github.com/RTXteam/RTX/blob/884a253d33cb644ec3a20fa60fcc005b111037ef/code/kg2/rtx_kg1_neo4j_to_kg_json.py#L33-L62

saramsey commented 3 years ago

I am not a chemist (IANAC), but I am not sure that CHEBI:35910 (2-oxo monocarboxylic acid, for which residue R1 by definition cannot contain a carboxyl group) and KEGG:C00161 (2-oxocarboxylic acid, for which residue R could in principle contain a carboxyl group), are the same thing.

Consider CHEBI:727000 (2-oxopimelic acid)

chebi-72700

It is a dicarboxylic acid, thus cannot subclass CHEBI:35910. But it is clearly a subclass of 2-oxocarboxylic acid. In any ontology, if you have classes A, B, and C, and C subclasses B and C does not subclass A, then A and B cannot be equivalent. Thus it seems to me that there should not be a biolink:same_as relationship between KEGG:C00161 and CHEBI:35910.

saramsey commented 3 years ago

Nevertheless, this issue raises a broader point. It would be nice if we could get cross-references between KEGG and CHEBI. Hi @ericawood and @kvarforl, I wonder if Unichem can help here? I note that Unichem lists some KEGG fields:

Screen Shot 2021-04-02 at 11 25 26 AM
saramsey commented 3 years ago

Nevertheless, this issue raises a broader point. It would be nice if we could get cross-references between KEGG and CHEBI. Hi @ericawood and @kvarforl, I wonder if Unichem can help here? I note that Unichem lists some KEGG fields:

Screen Shot 2021-04-02 at 11 25 26 AM

See #1354

saramsey commented 3 years ago

I note that the SRI node normalizer does not have an equivalence relationship between KEGG:C00161 and CHEBI:35910:

Screen Shot 2021-04-07 at 5 23 56 PM Screen Shot 2021-04-07 at 5 24 19 PM
chunyuma commented 3 years ago

Hi @saramsey, @kvarforl and @ericawood, is it possible that we can use Biopython to call KEGG api (eg. Bio.KEGG.REST.kegg_get) to extract the cross-references between KEGG and other DBs? This is what I'm now doing to get the smiles for KGEE curies.

saramsey commented 3 years ago

Hi @chunyuma: what KEGG-to-other-DB cross-references do you specifically need? We already get a lot of them via CHEBI, Uniprot, and other knowledge sources.

I wonder if we might be able to get more information about KEGG via CTD. Adding a dependency on the KEGG API in the KG2 build process would be my least-favored option. The thing about KG2 is its is a deterministic build. Starting from the same upstream knowledge source flat-files, rerunning the KG2 build, we would get the exact same result. As soon as we start introducing live API dependencies, that can change. Using flat files as inputs for building KG2 makes it easier to debug, fairly fast to build (considering the massive size of KG2 and the large number of gigabytes of upstream sources that it depends on), and robust against API changes or downtime. Hey @zheng-liu, do you know if we can get information about KEGG compounds from CTD? Can we get SMILES information?

ecwood commented 3 years ago

@chunyuma Are you looking for KEGG pathways or KEGG compounds? If it is KEGG compounds, the new Reactome ETL should provide some cross references to them. However with #1125, the KEGG nodes might be gone...

chunyuma commented 3 years ago

Hi @saramsey and @ericawood, actually I'm now collecting KEGG smiles information by extracting the CAS ID, ChEMBL ID, ChEBI ID and PubChem ID on KEGG website via KEGG API. Since @saramsey has ever asked this question:

Nevertheless, this issue raises a broader point. It would be nice if we could get cross-references between KEGG and CHEBI. Hi @ericawood and @kvarforl, I wonder if Unichem can help here? I note that Unichem lists some KEGG fields:

I just want to provide one possible way to make cross-references between KEGG and other DBs. But perhaps you will have a better way.

what KEGG-to-other-DB cross-references do you specifically need?

It is not necessary. I can do this on my specific KG for drug repurposing. But perhaps if we can do this cross-reference on KG2, it can help KG2c to cluster the concepts with the same semantic meaning.

Are you looking for KEGG pathways or KEGG compounds?

Hi @ericawood, I'm interested in KEGG compounds in order to collect its corresponding SMILES sequence. But I think I've found a way to do this via KEGG API. So no worries about this! Thanks so much for you help!

edeutsch commented 3 months ago

closing ancient history.