Closed ecwood closed 11 months ago
There were 36894 cases of
Node has IAO:0100001 attribute but not owl:deprecated; setting deprecated=True
in KG2.8.3's build-multi-ont-kg.log
.
After running
grep "IAO:0100001" build-multi-ont-kg.log | wc -l
and
grep "IAO:0100001" build-multi-ont-kg.log > iao_lines.txt
I ran
import json
if __name__ == '__main__':
sources = dict()
curie_prefixes = dict()
with open('iao_lines.txt') as file:
for line in file:
line = line.replace('Node has IAO:0100001 attribute but not owl:deprecated; setting deprecated=True: ', '').replace('[http://purl.obolibrary.org/obo/', '').replace(']', '').replace('[http://www.ebi.ac.uk/efo/', '')
line = line.split(' ')
source = line[0]
curie_prefix = ((line[1]).split(':'))[0]
if source not in sources:
sources[source] = 0
if curie_prefix not in curie_prefixes:
curie_prefixes[curie_prefix] = 0
sources[source] += 1
curie_prefixes[curie_prefix] += 1
print("Sources:")
print(json.dumps(sources, indent=4, sort_keys=True))
print("Curie Prefixes:")
print(json.dumps(curie_prefixes, indent=4, sort_keys=True))
on the make up of these cases. This was the output:
Sources:
{
"bspo.owl": 3,
"chebi.owl": 18530,
"cl.owl": 182,
"ddanat.owl": 4,
"doid.owl": 2,
"efo.owl": 5621,
"foodon.owl": 2017,
"genepio.owl": 117,
"go/extensions/go-plus.owl": 4377,
"hp.owl": 276,
"mi.owl": 2,
"mondo.owl": 2084,
"ncbitaxon/subsets/taxslim.owl": 76,
"pato.owl": 102,
"pr.owl": 2971,
"ro.owl": 4,
"uberon/ext.owl": 526
}
Curie Prefixes:
{
"BSPO": 3,
"BTO": 6,
"CHEBI": 18548,
"CL": 179,
"CP": 9,
"DDANAT": 4,
"DOID": 2,
"ECTO": 2,
"EFO": 1318,
"FBbt": 1,
"FOODON": 2020,
"GENEPIO": 116,
"GO": 4437,
"GOREL": 4,
"HANCESTRO": 1,
"HP": 300,
"MI": 2,
"MONDO": 2079,
"MP": 1,
"NCBITaxon": 86,
"OBA": 2,
"OBO": 8,
"ORPHANET": 4159,
"PATO": 102,
"PR": 2971,
"RO": 5,
"SO": 1,
"UBERON": 528
}
I added a bunch of bring statements to multi_ont_to_json_kg.py
to see what's going on and ran it on umls-omim
. This is what seems to be the problem:
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:609449.0001
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:609449.0002
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:609449.0002
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:609449.0003
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:609449.0003
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:609449.0004
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:609449.0004
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:609449.0005
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:609449.0005
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032579
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032579
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032575
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032575
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032576
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032576
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032577
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032577
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032578
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032578
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600538
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600538
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600539
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600539
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032573
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032573
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:MTHU032574
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:MTHU032574
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600567
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600567
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600568
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600568
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600563
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600563
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600564
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600564
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600565
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600565
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600566
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600566
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:613350.0004
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:613350.0004
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:600560
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:600560
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete meta; setting deprecated=True: OMIM:613350.0005
[https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/OMIM] Node has obsolete regex in name but not owl:deprecated; setting deprecated=True: OMIM:613350.0005
There were 105,670 instances of "obsolete regex" and 105,584 instances of "obsolete meta" in the log file.
Based on these added lines:
node_deprecated = node_meta.get('deprecated', False)
kg2_util.log_message(message="Node has obsolete meta; setting deprecated=True",
ontology_name=iri_of_ontology,
node_curie_id=node_curie_id,
output_stream=sys.stderr)
and
if REGEX_OBSOLETE.match(node_name) is not None:
node_deprecated = True
kg2_util.log_message(message="Node has obsolete regex in name but not owl:deprecated; setting deprecated=True",
ontology_name=iri_of_ontology,
node_curie_id=node_curie_id,
output_stream=sys.stderr)
Therefore: I suspect there is a parsing issue in creating node_meta
.
I agree, 66% doesn't seem reasonable, intuitively. I think it may be time to audit the code in multi_ont_to_kg_json.py
(or kg2_util.py
?) that determines if a node is deprecated or not.
I made a mistake when deciding what to log, the code should have been:
node_deprecated = node_meta.get('deprecated', False)
if node_meta.get('deprecated', False):
kg2_util.log_message(message="Node has obsolete meta; setting deprecated=True",
ontology_name=iri_of_ontology,
node_curie_id=node_curie_id,
output_stream=sys.stderr)
for that assignment of node_deprecated
.
After fixing that, there are no instances of "obsolete meta" in the output log.
This seems to have started with #7 and https://github.com/RTXteam/RTX/issues/995.
Notably, the RegEx is not catching what we want:
REGEX_OBSOLETE = re.compile("^obsolete|\(obsolete||obsolete$", re.IGNORECASE)
since it is catching empty matches.
Adjusting it to:
REGEX_OBSOLETE = re.compile("^obsolete|\(obsolete|obsolete$", re.IGNORECASE)
makes it so it no longer catches empty matches. This seems to fix the problem. I will test it on GO
, PR
, DOID
, and MONDO
, since those posed the original issue.
That seemed to work. Here's a snippet of the output with those sources. Most nodes are not deprecated but the obsolete ones are:
{
"category": "biolink:Protein",
"category_label": "protein",
"creation_date": null,
"deprecated": true,
"description": null,
"full_name": null,
"has_biological_sequence": null,
"id": "PR:Q8TAD1",
"iri": "http://purl.obolibrary.org/obo/PR_Q8TAD1",
"name": "obsolete sperm protein associated with the nucleus on the X chromosome C (human)",
"provided_by": [
"OBO:pr.owl"
],
"publications": [],
"replaced_by": null,
"synonym": [],
"update_date": "2023-06-28 23:45:25 GMT"
},
{
"category": "biolink:Gene",
"category_label": "gene",
"creation_date": null,
"deprecated": false,
"description": "A protein coding gene NAT14 in human. // COMMENTS: Category=external.",
"full_name": null,
"has_biological_sequence": null,
"id": "HGNC:28918",
"iri": "https://identifiers.org/hgnc:28918",
"name": "NAT14 (human)",
"provided_by": [
"OBO:pr.owl"
],
"publications": [],
"replaced_by": null,
"synonym": [],
"update_date": "2023-06-28 23:45:25 GMT"
},
{
"category": "biolink:Gene",
"category_label": "gene",
"creation_date": null,
"deprecated": false,
"description": "A protein coding gene JAZF1 in human. // COMMENTS: Category=external.",
"full_name": null,
"has_biological_sequence": null,
"id": "HGNC:28917",
"iri": "https://identifiers.org/hgnc:28917",
"name": "JAZF1 (human)",
"provided_by": [
"OBO:pr.owl"
],
"publications": [],
"replaced_by": null,
"synonym": [],
"update_date": "2023-06-28 23:45:25 GMT"
},
I'm going to commit the change and mark it for verification.
To confirm further, I ran the report script on that limited test:
"sources": [
"Biolink meta-model version downloaded:2023-07-05 20:34:45 GMT",
"Online Mendelian Inheritance in Man version 2023_02_05",
"Gene Ontology version http://purl.obolibrary.org/obo/go/releases/2023-06-11/extensions/go-plus.owl",
"MONDO Disease Ontology version http://purl.obolibrary.org/obo/mondo/releases/2023-06-01/mondo.owl",
"Disease Ontology version http://purl.obolibrary.org/obo/doid/releases/2023-05-31/doid.owl",
"Protein Ontology version http://purl.obolibrary.org/obo/pr/68.0/pr.owl"
],
These were the deprecated nodes count:
"number_of_deprecated_nodes": {
"OBO:go/extensions/go-plus.owl": 8274,
"OBO:mondo.owl": 3294,
"OBO:doid.owl": 2477,
"OBO:pr.owl": 5406
},
Source | New Test | Old Results | Improved? |
---|---|---|---|
Biolink Ontology | 0 | 916 | Yes |
OMIM | 0 | 101104 | Yes |
Gene Ontology | 8274 | 14674 | Yes |
MONDO | 3294 | 17287 | Yes |
Disease Ontology | 2477 | 14931 | Yes |
Protein Ontology | 5406 | 307078 | Yes |
This is a big improvement.
I reran the original commands:
match (n) where n.deprecated="True" return n.provided_by, count(n) order by n.provided_by
and
match (n) return n.provided_by, count(n) order by n.provided_by
n.provided_by | Deprecate Nodes | Total Nodes |
---|---|---|
['infores:bspo'] | 4 | 177 |
"['infores:chebi', 'infores:foodon']" | 2 | 422 |
"['infores:chebi', 'infores:genepio']" | 1 | 131 |
"['infores:chebi', 'infores:go-plus']" | 8 | 20941 |
"['infores:chebi', 'infores:hpo']" | 1 | 20 |
"['infores:chebi', 'infores:mondo']" | 1 | 24 |
['infores:chebi'] | 8590 | 157310 |
"['infores:cl', 'infores:ehdaa2', 'infores:mondo']" | 1 | 1 |
"['infores:cl', 'infores:ehdaa2']" | 5 | 16 |
"['infores:cl', 'infores:go-plus']" | 2 | 132 |
"['infores:cl', 'infores:mondo']" | 7 | 11 |
['infores:cl'] | 236 | 1070 |
['infores:dda'] | 4 | 70 |
"['infores:disease-ontology', 'infores:mondo', 'infores:ncbi-taxon']" | 14 | 587 |
"['infores:disease-ontology', 'infores:ncbi-taxon']" | 2 | 83 |
['infores:disease-ontology'] | 2476 | 14758 |
"['infores:efo', 'infores:chebi']" | 9 | 662 |
"['infores:efo', 'infores:cl', 'infores:hpo']" | 1 | 5 |
"['infores:efo', 'infores:cl']" | 6 | 154 |
"['infores:efo', 'infores:disease-ontology', 'infores:mondo', 'infores:ncbi-taxon']" | 5 | 47 |
"['infores:efo', 'infores:genepio', 'infores:hpo', 'infores:mondo', 'infores:hpo']" | 1 | 64 |
"['infores:efo', 'infores:go-plus', 'infores:go']" | 1 | 128 |
"['infores:efo', 'infores:go-plus', 'infores:hpo', 'infores:go']" | 1 | 8 |
"['infores:efo', 'infores:go-plus', 'infores:mondo', 'infores:go']" | 2 | 20 |
"['infores:efo', 'infores:go-plus', 'infores:mondo']" | 19 | 20 |
"['infores:efo', 'infores:go-plus']" | 14 | 40 |
"['infores:efo', 'infores:hpo', 'infores:hpo']" | 1 | 924 |
"['infores:efo', 'infores:hpo', 'infores:mondo']" | 1 | 1 |
"['infores:efo', 'infores:hpo']" | 12 | 14 |
"['infores:efo', 'infores:mondo', 'infores:ncbi-taxon']" | 3 | 18 |
"['infores:efo', 'infores:mondo']" | 259 | 9378 |
"['infores:efo', 'infores:ncbi-taxon']" | 9 | 914 |
"['infores:efo', 'infores:ordo']" | 4269 | 6175 |
"['infores:efo', 'infores:uberon']" | 5 | 219 |
['infores:efo'] | 1450 | 23250 |
['infores:ehdaa2'] | 5 | 2651 |
"['infores:foodon', 'infores:genepio', 'infores:mondo']" | 3 | 112 |
"['infores:foodon', 'infores:genepio']" | 6 | 1772 |
"['infores:foodon', 'infores:mondo']" | 12 | 134 |
"['infores:foodon', 'infores:ncbi-taxon']" | 1 | 4544 |
['infores:foodon'] | 2068 | 24692 |
"['infores:genepio', 'infores:mondo']" | 3 | 110 |
"['infores:genepio', 'infores:pato']" | 1 | 8 |
['infores:genepio'] | 139 | 3020 |
"['infores:go-plus', 'infores:go']" | 971 | 33967 |
"['infores:go-plus', 'infores:hpo', 'infores:go']" | 1 | 132 |
"['infores:go-plus', 'infores:hpo', 'infores:mondo', 'infores:go']" | 2 | 210 |
"['infores:go-plus', 'infores:hpo', 'infores:mondo']" | 1 | 1 |
"['infores:go-plus', 'infores:ino']" | 1 | 1 |
"['infores:go-plus', 'infores:mondo', 'infores:go']" | 63 | 824 |
"['infores:go-plus', 'infores:mondo', 'infores:ro']" | 1 | 1 |
"['infores:go-plus', 'infores:mondo']" | 142 | 154 |
"['infores:go-plus', 'infores:nbo']" | 2 | 2 |
"['infores:go-plus', 'infores:pr', 'infores:go']" | 1 | 129 |
['infores:go-plus'] | 7043 | 7871 |
"['infores:hl7-umls', 'infores:umls-metathesaurus']" | 2 | 2174 |
['infores:hl7-umls'] | 1 | 3945 |
"['infores:hpo', 'infores:hpo']" | 10 | 14377 |
"['infores:hpo', 'infores:mondo']" | 13 | 28 |
['infores:hpo'] | 345 | 1019 |
"['infores:ino', 'infores:mi']" | 3 | 107 |
"['infores:ino', 'infores:ro']" | 2 | 13 |
['infores:ino'] | 1 | 195 |
['infores:mi'] | 177 | 1531 |
"['infores:mondo', 'infores:ncbi-taxon', 'infores:pr']" | 1 | 2 |
"['infores:mondo', 'infores:ncbi-taxon']" | 11 | 316 |
"['infores:mondo', 'infores:pato']" | 6 | 40 |
"['infores:mondo', 'infores:ro']" | 1 | 55 |
"['infores:mondo', 'infores:uberon']" | 1 | 160 |
['infores:mondo'] | 3012 | 17685 |
['infores:nbo'] | 2 | 781 |
['infores:ncbi-taxon'] | 134 | 695391 |
['infores:ncit'] | 2 | 170725 |
['infores:ordo'] | 426 | 8109 |
['infores:pato'] | 996 | 2096 |
['infores:pr'] | 5406 | 302401 |
['infores:ro'] | 11 | 367 |
['infores:uberon'] | 1366 | 8663 |
"['infores:umls', 'infores:umls-metathesaurus']" | 2 | 194227 |
['infores:umls'] | 1 | 1661228 |
['infores:umls-metathesaurus'] | 5 | 158763 |
In total, now only 49,825 of the 8,436,874 nodes in KG2.8.4
are marked as deprecated
. This is only 0.59% of the entire graph. Since this issue is resolved, I am going to close this issue.
As mentioned in an issue I cannot now find (but I thought was #129; discovered due to #212), 66.7% of the nodes in
KG2.8.3
are listed as deprecated. (7589264 of the total 11367640). Almost all of these seem to be coming from UMLS and/or ontologies (i.e. sources brought in usingmulti_ont_to_json_kg.py
). This is not good, because it makes it hard to differentiate actually deprecated nodes from other UMLS/ontology nodes. I suspect there is a significant bug in the UMLS/ontology import. Here is the breakdown by source with deprecated nodes. I put together the results of two Cypher queries:and
I suspect the one node (for many of these sources) that is not deprecated is a symptom of