RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
34 stars 9 forks source link

Concerned with the Number of Deprecated Nodes in RTX-KG2 #315

Closed ecwood closed 11 months ago

ecwood commented 12 months ago

As mentioned in an issue I cannot now find (but I thought was #129; discovered due to #212), 66.7% of the nodes in KG2.8.3 are listed as deprecated. (7589264 of the total 11367640). Almost all of these seem to be coming from UMLS and/or ontologies (i.e. sources brought in using multi_ont_to_json_kg.py). This is not good, because it makes it hard to differentiate actually deprecated nodes from other UMLS/ontology nodes. I suspect there is a significant bug in the UMLS/ontology import. Here is the breakdown by source with deprecated nodes. I put together the results of two Cypher queries:

match (n) where n.deprecated="True" return n.provided_by, count(n) order by n.provided_by

and

match (n) return n.provided_by, count(n) order by n.provided_by

I suspect the one node (for many of these sources) that is not deprecated is a symptom of

Source Deprecated Nodes Total Nodes
"['infores:atc-codes-umls']" 6443 6444
"['infores:bfo']" 50 51
"['infores:biolink-ontology']" 916 922
"['infores:bspo']" 177 178
"['infores:chebi']" 182279 182283
"['infores:cl']" 1765 1767
"['infores:dda']" 69 70
"['infores:disease-ontology']" 14931 14932
"['infores:drugbank']" 8369 14600
"['infores:efo']" 34334 34338
"['infores:ehdaa2']" 2651 2652
"['infores:fma-obo']" 659 660
"['infores:fma-umls']" 104527 104528
"['infores:foodon']" 27019 27031
"['infores:genepio']" 2877 3022
"['infores:go']" 44012 44013
"['infores:go-plus']" 14674 14695
"['infores:hcp-codes-umls']" 7154 7155
"['infores:hgnc']" 42541 42542
"['infores:hl7-umls']" 6251 6301
"['infores:hpo']" 18371 18375
"['infores:icd10pcs-umls']" 190989 190990
"['infores:icd9cm-umls']" 22412 22413
"['infores:ino']" 302 303
"['infores:loinc-umls']" 281890 281891
"['infores:medlineplus']" 303 347
"['infores:medrt-umls']" 36 39
"['infores:mesh']" 348769 348770
"['infores:mi']" 1530 1534
"['infores:mondo']" 17287 17320
"['infores:nbo']" 804 805
"['infores:ncbi-gene', 'infores:pr']" 18 18
"['infores:ncbi-taxon']" 1987092 1987093
"['infores:ncit']" 163637 163638
"['infores:nddf-umls']" 30904 30905
"['infores:omim']" 101104 101105
"['infores:ordo']" 8108 8109
"['infores:pato']" 2172 2183
"['infores:pdq-umls']" 13341 13342
"['infores:pr']" 307078 307079
"['infores:psy-umls']" 7968 7969
"['infores:ro']" 494 496
"['infores:rxnorm']" 106885 106886
"['infores:uberon']" 11952 11981
"['infores:umls']" 3361193 3361195
"['infores:umls-metathesaurus']" 73121 135507
"['infores:vandf-umls']" 29806 29807
ecwood commented 12 months ago

There were 36894 cases of

Node has IAO:0100001 attribute but not owl:deprecated; setting deprecated=True

in KG2.8.3's build-multi-ont-kg.log.

After running

grep "IAO:0100001" build-multi-ont-kg.log | wc -l

and

grep "IAO:0100001" build-multi-ont-kg.log > iao_lines.txt

I ran

import json

if __name__ == '__main__':
    sources = dict()
    curie_prefixes = dict()
    with open('iao_lines.txt') as file:
        for line in file:
            line = line.replace('Node has IAO:0100001 attribute but not owl:deprecated; setting deprecated=True: ', '').replace('[http://purl.obolibrary.org/obo/', '').replace(']', '').replace('[http://www.ebi.ac.uk/efo/', '')
            line = line.split(' ')
            source = line[0]
            curie_prefix = ((line[1]).split(':'))[0]

            if source not in sources:
                sources[source] = 0
            if curie_prefix not in curie_prefixes:
                curie_prefixes[curie_prefix] = 0

            sources[source] += 1
            curie_prefixes[curie_prefix] += 1

    print("Sources:")
    print(json.dumps(sources, indent=4, sort_keys=True))

    print("Curie Prefixes:")
    print(json.dumps(curie_prefixes, indent=4, sort_keys=True))

on the make up of these cases. This was the output:

Sources:
{
    "bspo.owl": 3,
    "chebi.owl": 18530,
    "cl.owl": 182,
    "ddanat.owl": 4,
    "doid.owl": 2,
    "efo.owl": 5621,
    "foodon.owl": 2017,
    "genepio.owl": 117,
    "go/extensions/go-plus.owl": 4377,
    "hp.owl": 276,
    "mi.owl": 2,
    "mondo.owl": 2084,
    "ncbitaxon/subsets/taxslim.owl": 76,
    "pato.owl": 102,
    "pr.owl": 2971,
    "ro.owl": 4,
    "uberon/ext.owl": 526
}
Curie Prefixes:
{
    "BSPO": 3,
    "BTO": 6,
    "CHEBI": 18548,
    "CL": 179,
    "CP": 9,
    "DDANAT": 4,
    "DOID": 2,
    "ECTO": 2,
    "EFO": 1318,
    "FBbt": 1,
    "FOODON": 2020,
    "GENEPIO": 116,
    "GO": 4437,
    "GOREL": 4,
    "HANCESTRO": 1,
    "HP": 300,
    "MI": 2,
    "MONDO": 2079,
    "MP": 1,
    "NCBITaxon": 86,
    "OBA": 2,
    "OBO": 8,
    "ORPHANET": 4159,
    "PATO": 102,
    "PR": 2971,
    "RO": 5,
    "SO": 1,
    "UBERON": 528
}
ecwood commented 12 months ago

I added a bunch of bring statements to multi_ont_to_json_kg.py to see what's going on and ran it on umls-omim. This is what seems to be the problem:

[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:609449.0001
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:609449.0002
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:609449.0002
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:609449.0003
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:609449.0003
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:609449.0004
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:609449.0004
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:609449.0005
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:609449.0005
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032579
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032579
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032575
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032575
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032576
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032576
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032577
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032577
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032578
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032578
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600538
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600538
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600539
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600539
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032573
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032573
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032574
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032574
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600567
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600567
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600568
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600568
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600563
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600563
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600564
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600564
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600565
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600565
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600566
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600566
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:613350.0004
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:613350.0004
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600560
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600560
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:613350.0005
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:613350.0005

There were 105,670 instances of "obsolete regex" and 105,584 instances of "obsolete meta" in the log file.

Based on these added lines:

                node_deprecated = node_meta.get('deprecated', False)
                kg2_util.log_message(message="Node has obsolete meta; setting deprecated=True",
                                     ontology_name=iri_of_ontology,
                                     node_curie_id=node_curie_id,
                                     output_stream=sys.stderr)

and

                if REGEX_OBSOLETE.match(node_name) is not None:
                    node_deprecated = True
                    kg2_util.log_message(message="Node has obsolete regex in name but not owl:deprecated; setting deprecated=True",
                                         ontology_name=iri_of_ontology,
                                         node_curie_id=node_curie_id,
                                         output_stream=sys.stderr)

Therefore: I suspect there is a parsing issue in creating node_meta.

saramsey commented 12 months ago

I agree, 66% doesn't seem reasonable, intuitively. I think it may be time to audit the code in multi_ont_to_kg_json.py (or kg2_util.py?) that determines if a node is deprecated or not.

ecwood commented 12 months ago

I made a mistake when deciding what to log, the code should have been:

                node_deprecated = node_meta.get('deprecated', False)
                if node_meta.get('deprecated', False):
                    kg2_util.log_message(message="Node has obsolete meta; setting deprecated=True",
                                         ontology_name=iri_of_ontology,
                                         node_curie_id=node_curie_id,
                                         output_stream=sys.stderr)

for that assignment of node_deprecated.

After fixing that, there are no instances of "obsolete meta" in the output log.

ecwood commented 12 months ago

This seems to have started with #7 and https://github.com/RTXteam/RTX/issues/995.

Notably, the RegEx is not catching what we want:

REGEX_OBSOLETE = re.compile("^obsolete|\(obsolete||obsolete$", re.IGNORECASE)

since it is catching empty matches.

Adjusting it to:

REGEX_OBSOLETE = re.compile("^obsolete|\(obsolete|obsolete$", re.IGNORECASE)

makes it so it no longer catches empty matches. This seems to fix the problem. I will test it on GO, PR, DOID, and MONDO, since those posed the original issue.

ecwood commented 12 months ago

That seemed to work. Here's a snippet of the output with those sources. Most nodes are not deprecated but the obsolete ones are:

        {
            "category": "biolink:Protein",
            "category_label": "protein",
            "creation_date": null,
            "deprecated": true,
            "description": null,
            "full_name": null,
            "has_biological_sequence": null,
            "id": "PR:Q8TAD1",
            "iri": "http://purl.obolibrary.org/obo/PR_Q8TAD1",
            "name": "obsolete sperm protein associated with the nucleus on the X chromosome C (human)",
            "provided_by": [
                "OBO:pr.owl"
            ],
            "publications": [],
            "replaced_by": null,
            "synonym": [],
            "update_date": "2023-06-28 23:45:25 GMT"
        },
        {
            "category": "biolink:Gene",
            "category_label": "gene",
            "creation_date": null,
            "deprecated": false,
            "description": "A protein coding gene NAT14 in human. // COMMENTS: Category=external.",
            "full_name": null,
            "has_biological_sequence": null,
            "id": "HGNC:28918",
            "iri": "https://identifiers.org/hgnc:28918",
            "name": "NAT14 (human)",
            "provided_by": [
                "OBO:pr.owl"
            ],
            "publications": [],
            "replaced_by": null,
            "synonym": [],
            "update_date": "2023-06-28 23:45:25 GMT"
        },
        {
            "category": "biolink:Gene",
            "category_label": "gene",
            "creation_date": null,
            "deprecated": false,
            "description": "A protein coding gene JAZF1 in human. // COMMENTS: Category=external.",
            "full_name": null,
            "has_biological_sequence": null,
            "id": "HGNC:28917",
            "iri": "https://identifiers.org/hgnc:28917",
            "name": "JAZF1 (human)",
            "provided_by": [
                "OBO:pr.owl"
            ],
            "publications": [],
            "replaced_by": null,
            "synonym": [],
            "update_date": "2023-06-28 23:45:25 GMT"
        },

I'm going to commit the change and mark it for verification.

ecwood commented 12 months ago

To confirm further, I ran the report script on that limited test:

    "sources": [
        "Biolink meta-model version downloaded:2023-07-05 20:34:45 GMT",
        "Online Mendelian Inheritance in Man version 2023_02_05",
        "Gene Ontology version http://purl.obolibrary.org/obo/go/releases/2023-06-11/extensions/go-plus.owl",
        "MONDO Disease Ontology version http://purl.obolibrary.org/obo/mondo/releases/2023-06-01/mondo.owl",
        "Disease Ontology version http://purl.obolibrary.org/obo/doid/releases/2023-05-31/doid.owl",
        "Protein Ontology version http://purl.obolibrary.org/obo/pr/68.0/pr.owl"
    ],

These were the deprecated nodes count:

    "number_of_deprecated_nodes": {
        "OBO:go/extensions/go-plus.owl": 8274,
        "OBO:mondo.owl": 3294,
        "OBO:doid.owl": 2477,
        "OBO:pr.owl": 5406
    },
Source New Test Old Results Improved?
Biolink Ontology 0 916 Yes
OMIM 0 101104 Yes
Gene Ontology 8274 14674 Yes
MONDO 3294 17287 Yes
Disease Ontology 2477 14931 Yes
Protein Ontology 5406 307078 Yes

This is a big improvement.

ecwood commented 11 months ago

I reran the original commands:

match (n) where n.deprecated="True" return n.provided_by, count(n) order by n.provided_by

and

match (n) return n.provided_by, count(n) order by n.provided_by
n.provided_by Deprecate Nodes Total Nodes
['infores:bspo'] 4 177
"['infores:chebi', 'infores:foodon']" 2 422
"['infores:chebi', 'infores:genepio']" 1 131
"['infores:chebi', 'infores:go-plus']" 8 20941
"['infores:chebi', 'infores:hpo']" 1 20
"['infores:chebi', 'infores:mondo']" 1 24
['infores:chebi'] 8590 157310
"['infores:cl', 'infores:ehdaa2', 'infores:mondo']" 1 1
"['infores:cl', 'infores:ehdaa2']" 5 16
"['infores:cl', 'infores:go-plus']" 2 132
"['infores:cl', 'infores:mondo']" 7 11
['infores:cl'] 236 1070
['infores:dda'] 4 70
"['infores:disease-ontology', 'infores:mondo', 'infores:ncbi-taxon']" 14 587
"['infores:disease-ontology', 'infores:ncbi-taxon']" 2 83
['infores:disease-ontology'] 2476 14758
"['infores:efo', 'infores:chebi']" 9 662
"['infores:efo', 'infores:cl', 'infores:hpo']" 1 5
"['infores:efo', 'infores:cl']" 6 154
"['infores:efo', 'infores:disease-ontology', 'infores:mondo', 'infores:ncbi-taxon']" 5 47
"['infores:efo', 'infores:genepio', 'infores:hpo', 'infores:mondo', 'infores:hpo']" 1 64
"['infores:efo', 'infores:go-plus', 'infores:go']" 1 128
"['infores:efo', 'infores:go-plus', 'infores:hpo', 'infores:go']" 1 8
"['infores:efo', 'infores:go-plus', 'infores:mondo', 'infores:go']" 2 20
"['infores:efo', 'infores:go-plus', 'infores:mondo']" 19 20
"['infores:efo', 'infores:go-plus']" 14 40
"['infores:efo', 'infores:hpo', 'infores:hpo']" 1 924
"['infores:efo', 'infores:hpo', 'infores:mondo']" 1 1
"['infores:efo', 'infores:hpo']" 12 14
"['infores:efo', 'infores:mondo', 'infores:ncbi-taxon']" 3 18
"['infores:efo', 'infores:mondo']" 259 9378
"['infores:efo', 'infores:ncbi-taxon']" 9 914
"['infores:efo', 'infores:ordo']" 4269 6175
"['infores:efo', 'infores:uberon']" 5 219
['infores:efo'] 1450 23250
['infores:ehdaa2'] 5 2651
"['infores:foodon', 'infores:genepio', 'infores:mondo']" 3 112
"['infores:foodon', 'infores:genepio']" 6 1772
"['infores:foodon', 'infores:mondo']" 12 134
"['infores:foodon', 'infores:ncbi-taxon']" 1 4544
['infores:foodon'] 2068 24692
"['infores:genepio', 'infores:mondo']" 3 110
"['infores:genepio', 'infores:pato']" 1 8
['infores:genepio'] 139 3020
"['infores:go-plus', 'infores:go']" 971 33967
"['infores:go-plus', 'infores:hpo', 'infores:go']" 1 132
"['infores:go-plus', 'infores:hpo', 'infores:mondo', 'infores:go']" 2 210
"['infores:go-plus', 'infores:hpo', 'infores:mondo']" 1 1
"['infores:go-plus', 'infores:ino']" 1 1
"['infores:go-plus', 'infores:mondo', 'infores:go']" 63 824
"['infores:go-plus', 'infores:mondo', 'infores:ro']" 1 1
"['infores:go-plus', 'infores:mondo']" 142 154
"['infores:go-plus', 'infores:nbo']" 2 2
"['infores:go-plus', 'infores:pr', 'infores:go']" 1 129
['infores:go-plus'] 7043 7871
"['infores:hl7-umls', 'infores:umls-metathesaurus']" 2 2174
['infores:hl7-umls'] 1 3945
"['infores:hpo', 'infores:hpo']" 10 14377
"['infores:hpo', 'infores:mondo']" 13 28
['infores:hpo'] 345 1019
"['infores:ino', 'infores:mi']" 3 107
"['infores:ino', 'infores:ro']" 2 13
['infores:ino'] 1 195
['infores:mi'] 177 1531
"['infores:mondo', 'infores:ncbi-taxon', 'infores:pr']" 1 2
"['infores:mondo', 'infores:ncbi-taxon']" 11 316
"['infores:mondo', 'infores:pato']" 6 40
"['infores:mondo', 'infores:ro']" 1 55
"['infores:mondo', 'infores:uberon']" 1 160
['infores:mondo'] 3012 17685
['infores:nbo'] 2 781
['infores:ncbi-taxon'] 134 695391
['infores:ncit'] 2 170725
['infores:ordo'] 426 8109
['infores:pato'] 996 2096
['infores:pr'] 5406 302401
['infores:ro'] 11 367
['infores:uberon'] 1366 8663
"['infores:umls', 'infores:umls-metathesaurus']" 2 194227
['infores:umls'] 1 1661228
['infores:umls-metathesaurus'] 5 158763

In total, now only 49,825 of the 8,436,874 nodes in KG2.8.4 are marked as deprecated. This is only 0.59% of the entire graph. Since this issue is resolved, I am going to close this issue.