RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
39 stars 8 forks source link

Protein returned as a chemical #256

Open cbizon opened 1 year ago

cbizon commented 1 year ago

This query asks what chemical entity interacts with a particular gene:

{"query_graph": {
                    "nodes": {
                        "$source": {
                            "categories": [
                                "biolink:ChemicalEntity"
                            ]
                        },
                        "$target": {
                            "ids": [
                                "NCBIGene:23162"
                            ],
                            "categories": [
                                "biolink:Gene"
                            ]
                        }
                    },
                    "edges": {
                        "edge_1": {
                            "subject": "$source",
                            "object": "$target",
                            "predicates": [
                                "biolink:interacts_with"
                            ]
                        }
                    }
                }}

It returns 3 identifiers for $source: {'id': 'UniProtKB:P45983'} {'id': 'MESH:D000888'} {'id': 'UMLS:C0444626'}

The first is a protein. Of course, we might consider a protein a chemical, but I'm pretty sure that biolink model does not (for this reason so that we can talk about non-protein things and not get our genes all mixed up with other chemicals)

amykglen commented 1 year ago

it looks like, during canonicalization of KG2, one of the concepts equivalent to UniProtKB:P45983 has a category of ChemicalEntity, which explains why that concept cluster is being returned.

maybe it's an issue in the KG2pre LOINC ingest?

https://arax.ncats.io/?term=UniProtKB:P45983

Screen Shot 2023-02-01 at 4 45 02 PM

saramsey commented 1 year ago

I note that in KG2.8.0pre, LOINC:LP36223-3 is marked as deprecated. One possibility that we might want to consider is having the KG2c build process intentionally not canonicalize (or rather, intentionally not include in canonicalization) any KG2pre node that is marked as deprecated=true.

amykglen commented 1 year ago

noted - definitely worth considering excluding deprecated nodes from canonicalization. although, apparently 68% of the nodes in KG2pre are marked as deprecated=true, so if we fully excluded those nodes, I think our synonymization would be able to recognize far fewer curies.

one possible alternative is to, rather than fully exclude deprecated nodes from canonicalization, instead just don't automatically consider the cluster to have deprecated nodes' category... i.e., in this case, LOINC:LP36223-3 would still belong to the cluster, but that cluster wouldn't be returned as a result when someone asks for ChemicalEntity.

another alternative is some sort of category voting where there must be some threshold fraction of member nodes that have a given category before it will be assigned to the overall cluster.