Closed chunyuma closed 3 months ago
Hi @chunyuma thanks for the report. I see the issue, but I don't think the NodeSynonymizer can help here. As far as I can tell, KEGG:C00161 and CHEBI:35910 are not linked to each other via registered equivalency in KG2.5.2 or in the SRI Node Normalizer. And their names are different. So there's no way currently to link them by the NodeSynonymizer unless I put in a manual exception. which I could.
It seems to me that we should ask the KG2 team to see if the equivalencies to ChEBI:35910 and PubChem can be made during ETL of KEGG?. Tagging @ericawood and @saramsey for an assessment. Let me know if you think I've misunderstood.
As far as I can tell, KEGG is "pay to use" (see https://www.pathway.jp/en/academic.html). Thus, we can't ETL KEGG directly without paying for a license. The parts of KEGG that are currently in KG2 are remnants of KG1 (see code below): https://github.com/RTXteam/RTX/blob/884a253d33cb644ec3a20fa60fcc005b111037ef/code/kg2/rtx_kg1_neo4j_to_kg_json.py#L33-L62
I am not a chemist (IANAC), but I am not sure that CHEBI:35910 (2-oxo monocarboxylic acid, for which residue R1 by definition cannot contain a carboxyl group) and KEGG:C00161 (2-oxocarboxylic acid, for which residue R could in principle contain a carboxyl group), are the same thing.
Consider CHEBI:727000 (2-oxopimelic acid)
It is a dicarboxylic acid, thus cannot subclass CHEBI:35910. But it is clearly a subclass of 2-oxocarboxylic acid. In any ontology, if you have classes A, B, and C, and C subclasses B and C does not subclass A, then A and B cannot be equivalent. Thus it seems to me that there should not be a biolink:same_as
relationship between KEGG:C00161
and CHEBI:35910
.
Nevertheless, this issue raises a broader point. It would be nice if we could get cross-references between KEGG and CHEBI. Hi @ericawood and @kvarforl, I wonder if Unichem can help here? I note that Unichem lists some KEGG fields:
Nevertheless, this issue raises a broader point. It would be nice if we could get cross-references between KEGG and CHEBI. Hi @ericawood and @kvarforl, I wonder if Unichem can help here? I note that Unichem lists some KEGG fields:
See #1354
I note that the SRI node normalizer does not have an equivalence relationship between KEGG:C00161
and CHEBI:35910
:
Hi @saramsey, @kvarforl and @ericawood, is it possible that we can use Biopython to call KEGG api (eg. Bio.KEGG.REST.kegg_get) to extract the cross-references between KEGG and other DBs? This is what I'm now doing to get the smiles for KGEE curies.
Hi @chunyuma: what KEGG-to-other-DB cross-references do you specifically need? We already get a lot of them via CHEBI, Uniprot, and other knowledge sources.
I wonder if we might be able to get more information about KEGG via CTD. Adding a dependency on the KEGG API in the KG2 build process would be my least-favored option. The thing about KG2 is its is a deterministic build. Starting from the same upstream knowledge source flat-files, rerunning the KG2 build, we would get the exact same result. As soon as we start introducing live API dependencies, that can change. Using flat files as inputs for building KG2 makes it easier to debug, fairly fast to build (considering the massive size of KG2 and the large number of gigabytes of upstream sources that it depends on), and robust against API changes or downtime. Hey @zheng-liu, do you know if we can get information about KEGG compounds from CTD? Can we get SMILES information?
@chunyuma Are you looking for KEGG pathways or KEGG compounds? If it is KEGG compounds, the new Reactome ETL should provide some cross references to them. However with #1125, the KEGG nodes might be gone...
Hi @saramsey and @ericawood, actually I'm now collecting KEGG smiles information by extracting the CAS ID, ChEMBL ID, ChEBI ID and PubChem ID on KEGG website via KEGG API. Since @saramsey has ever asked this question:
Nevertheless, this issue raises a broader point. It would be nice if we could get cross-references between KEGG and CHEBI. Hi @ericawood and @kvarforl, I wonder if Unichem can help here? I note that Unichem lists some KEGG fields:
I just want to provide one possible way to make cross-references between KEGG and other DBs. But perhaps you will have a better way.
what KEGG-to-other-DB cross-references do you specifically need?
It is not necessary. I can do this on my specific KG for drug repurposing. But perhaps if we can do this cross-reference on KG2, it can help KG2c to cluster the concepts with the same semantic meaning.
Are you looking for KEGG pathways or KEGG compounds?
Hi @ericawood, I'm interested in KEGG compounds in order to collect its corresponding SMILES sequence. But I think I've found a way to do this via KEGG API. So no worries about this! Thanks so much for you help!
closing ancient history.
Hi @edeutsch,
Perhaps this might be another issue for NodeSynonymizer. So I tried to check the node source distribution of
biolink:Metabolite
in today's kg2.5.2c. As you can see below, most of them are fromKEGG
:If you check some of the
KEGG
curies in kg2.5.2c, theirequivalent_curies
are just themselves.If you take a look at
KEGG:C00161
, itsequivalent_curies
are just itself andPathWhiz.ElementCollection:552
.However, you check KEGG database for
KEGG:C00161
It should have a synonym
CHEBI:35910
:But right now, in kg2.5.2c, they are not clustered together and
CHEBI:35910
isbiolink:MolecularEntity
.It might have other
KEGG
curies that are like this case.