RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
39 stars 8 forks source link

Single Exon node with the name `Exon` #367

Open dkoslicki opened 7 months ago

dkoslicki commented 7 months ago

I might have mentioned it before, but there is only a single node with the category biolink:Exon: a node with the name Exon. I think either the ETL-ing of whatever KP has exon info is borked, or something else fishy might be going on. Otherwise, should this node (and the category) just be removed?

ecwood commented 3 months ago

This is the single biolink:Exon node in KG2 (checked in RTX-KG2.9.0pre):

{
  "iri": "http://www.ebi.ac.uk/efo/EFO_0004423",
  "synonym": [
    "exonic region"
  ],
  "category_label": "exon",
  "deprecated": "False",
  "name": "exon",
  "description": "An exon is a nucleic acid sequence that is represented in the mature form of an RNA molecule either after portions of a precursor RNA (introns) have been removed by cis-splicing or when two or more precursor RNA molecules have been ligated by trans-splicing.",
  "provided_by": "['infores:efo']",
  "id": "EFO:0004423",
  "category": "biolink:Exon",
  "update_date": "3630"
}

This node comes from EFO, which is in the multi ont load process. I would not be surprised if that ETL is "borked". I will take a look to see where this is coming from.

ecwood commented 3 months ago

Here is the term in efo.owl:

    <!-- http://www.ebi.ac.uk/efo/EFO_0004423 -->

    <owl:Class rdf:about="http://www.ebi.ac.uk/efo/EFO_0004423">
        <rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/BFO_0000040"/>
        <rdfs:subClassOf>
            <owl:Restriction>
                <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/BFO_0000050"/>
                <owl:someValuesFrom rdf:resource="http://www.ebi.ac.uk/efo/EFO_0004422"/>
            </owl:Restriction>
        </rdfs:subClassOf>
        <obo:IAO_0000115>An exon is a nucleic acid sequence that is represented in the mature form of an RNA molecule either after portions of a precursor RNA (introns) have been removed by cis-splicing or when two or more precursor RNA molecules have been ligated by trans-splicing.</obo:IAO_0000115>
        <oboInOwl:hasDbXref>NCIt:C13231</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref>SNOMEDCT:33091005</oboInOwl:hasDbXref>
        <oboInOwl:hasExactSynonym>exonic region</oboInOwl:hasExactSynonym>
        <rdfs:label>exon</rdfs:label>
    </owl:Class>
ecwood commented 3 months ago

EFO:0004423 is a subclass of material entity (BFO:0000040), along with several other similar terms. It looks like the same issue also shows up with a different subclass of material entity like enzyme:

{
  "iri": "http://purl.obolibrary.org/obo/OBI_0000427",
  "category_label": "protein",
  "deprecated": "False",
  "name": "enzyme",
  "description": "(protein or rna) or has_part (protein or rna) and has_function some GO:0003824 (catalytic activity); (protein or rna) or has_part (protein or rna) and has_function some GO:0003824 (catalytic activity)",
  "provided_by": "['infores:efo', 'infores:genepio']",
  "id": "OBI:0000427",
  "category": "biolink:Protein",
  "update_date": "2024-02-21 01:39:56 GMT"
}

These are all of the subclasses of material entity: image

Running

match (n) where n.iri in ["http://purl.obolibrary.org/obo/BTO_0002690", "http://www.ebi.ac.uk/efo/EFO_0004446", "http://purl.obolibrary.org/obo/BTO_0000214", "http://www.ebi.ac.uk/efo/EFO_0000324", "http://purl.obolibrary.org/obo/GO_0005575", "http://www.ebi.ac.uk/efo/EFO_0006794", "http://purl.obolibrary.org/obo/CHEBI_24431", "http://www.ebi.ac.uk/efo/EFO_0005066", "http://www.ebi.ac.uk/efo/EFO_0000469", "http://purl.obolibrary.org/obo/OBI_0000427", "http://www.ebi.ac.uk/efo/EFO_0004422", "http://www.ebi.ac.uk/efo/EFO_0004423", "http://purl.obolibrary.org/obo/SO_0000704", "http://www.ebi.ac.uk/efo/EFO_0004420", "http://www.ebi.ac.uk/efo/EFO_0000548", "http://www.ebi.ac.uk/efo/EFO_0005060", "http://purl.obolibrary.org/obo/OBI_0100026", "http://www.ebi.ac.uk/efo/EFO_0000635", "http://purl.obolibrary.org/obo/OBI_0000245", "http://purl.obolibrary.org/obo/MPATH_0", "http://www.ebi.ac.uk/efo/EFO_0000663", "http://purl.obolibrary.org/obo/OBI_0000181", "http://www.ebi.ac.uk/efo/EFO_0010579", "http://purl.obolibrary.org/obo/OBI_0100051", "http://www.ebi.ac.uk/efo/EFO_0004359", "http://purl.obolibrary.org/obo/BTO_0001384", "http://purl.obolibrary.org/obo/OBI_0100051"] return n.id, n.name, n.category, n.provided_by

on kg2endpoint-kg2-9-0.rtx.ai we get:

n.id n.name n.category n.provided_by
"GO:0005575" "cellular_component" "biolink:CellularComponent" "['infores:efo', 'infores:cl', 'infores:go-plus', 'infores:hpo', 'infores:mondo', 'infores:nbo', 'infores:pato', 'infores:pr', 'infores:uberon', 'infores:go']"
"CHEBI:24431" "chemical entity" "biolink:MolecularEntity" "['infores:efo', 'infores:chebi', 'infores:cl', 'infores:disease-ontology', 'infores:foodon', 'infores:genepio', 'infores:go-plus', 'infores:hpo', 'infores:mondo', 'infores:nbo', 'infores:pato', 'infores:pr', 'infores:uberon']"
"OBI:0100026" "organism" "biolink:PhysicalEntity" "['infores:efo', 'infores:foodon', 'infores:genepio', 'infores:go-plus', 'infores:pato', 'infores:pr', 'infores:ro']"
"SO:0000704" "gene" "biolink:Gene" "['infores:efo', 'infores:disease-ontology', 'infores:go-plus', 'infores:mondo', 'infores:pr', 'infores:uberon']"
"OBI:0100051" "specimen" "biolink:PhysicalEntity" "['infores:efo', 'infores:genepio']"
"EFO:0006794" "cerebrospinal fluid biomarker measurement" "biolink:InformationContentEntity" "['infores:efo']"
"EFO:0000635" "organism part" "biolink:AnatomicalEntity" "['infores:efo']"
"EFO:0000663" "pool" "biolink:PhysicalEntity" "['infores:efo']"
"EFO:0005060" "instrument part" "biolink:PhysicalEntity" "['infores:efo']"
"EFO:0005066" "collection of material" "biolink:MaterialSample" "['infores:efo']"
"BTO:0000214" "cell culture" "biolink:PhysicalEntity" "['infores:efo']"
"EFO:0004423" "exon" "biolink:Exon" "['infores:efo']"
"EFO:0004422" "exome" "biolink:PhysicalEntity" "['infores:efo']"
"EFO:0004420" "genome" "biolink:PhysicalEntity" "['infores:efo']"
"EFO:0004446" "biological macromolecule" "biolink:MolecularEntity" "['infores:efo']"
"EFO:0000324" "cell type" "biolink:Cell" "['infores:efo']"
"EFO:0000548" "instrument" "biolink:PhysicalEntity" "['infores:efo']"
"EFO:0000469" "environmental factor" "biolink:PhysicalEntity" "['infores:efo']"
"EFO:0010579" "proteome" "biolink:PhysicalEntity" "['infores:efo']"
"OBI:0000245" "organization" "biolink:PhysicalEntity" "['infores:efo', 'infores:foodon', 'infores:genepio']"
"MPATH:0" "pathological entity" "biolink:BiologicalEntity" "['infores:efo', 'infores:genepio', 'infores:hpo']"
"OBI:0000427" "enzyme" "biolink:Protein" "['infores:efo', 'infores:genepio']"
"BTO:0001384" "tissue culture" "biolink:PhysicalEntity" "['infores:efo']"
"EFO:0004359" "telomere" "biolink:PhysicalEntity" "['infores:efo']"
"OBI:0000181" "population" "biolink:PhysicalEntity" "['infores:efo', 'infores:genepio']"
"BTO:0002690" "biofilm" "biolink:PhysicalEntity" "['infores:efo']"

Many of these seem to be problematic.