Open chunyuma opened 3 years ago
This is certainly related to the ontobio issue. The owl:DatatypeProperty
's are being stored as nodes. From umls-hgnc.ttl
, there is this line, matching your example above:
umls-hgnc.ttl-<http://purl.bioontology.org/ontology/HGNC/PMID> a owl:DatatypeProperty ;
umls-hgnc.ttl: rdfs:label """Pubmed ID""";
umls-hgnc.ttl: rdfs:comment """Pubmed ID""" .
Here's this from umls-ncbi.ttl
:
<http://purl.bioontology.org/ontology/NCBITAXON/RANK> a owl:DatatypeProperty ;
rdfs:label """RANK""";
rdfs:comment """NCBI Rank (e.g. RANK[NCBI]species)""" .
However, with some of these, it seems to be an issue with the data itself:
taxslim.owl- <!-- http://purl.obolibrary.org/obo/ncbitaxon#genbank_common_name -->
taxslim.owl-
taxslim.owl- <owl:AnnotationProperty rdf:about="http://purl.obolibrary.org/obo/ncbitaxon#genbank_common_name">
taxslim.owl: <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">genbank common name</rdfs:label>
taxslim.owl- <rdfs:subPropertyOf rdf:resource="http://www.geneontology.org/formats/oboInOwl#SynonymTypeProperty"/>
taxslim.owl- </owl:AnnotationProperty>
taxslim.owl: <!-- http://purl.obolibrary.org/obo/NCBITaxon_superkingdom -->
taxslim.owl-
taxslim.owl: <owl:Class rdf:about="http://purl.obolibrary.org/obo/NCBITaxon_superkingdom">
taxslim.owl- <rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/NCBITaxon#_taxonomic_rank"/>
taxslim.owl- <oboInOwl:hasOBONamespace rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ncbi_taxonomy</oboInOwl:hasOBONamespace>
taxslim.owl: <oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">NCBITaxon:superkingdom</oboInOwl:id>
taxslim.owl: <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">superkingdom</rdfs:label>
taxslim.owl- </owl:Class>
This is also from taxslim.owl
:
<!-- http://purl.obolibrary.org/obo/NCBITaxon_subfamily -->
<owl:Class rdf:about="http://purl.obolibrary.org/obo/NCBITaxon_subfamily">
<rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/NCBITaxon#_taxonomic_rank"/>
<oboInOwl:hasOBONamespace rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ncbi_taxonomy</oboInOwl:hasOBONamespace>
<oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">NCBITaxon:subfamily</oboInOwl:id>
<rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">subfamily</rdfs:label>
</owl:Class>
This is from umls-nci.ttl
:
<http://purl.bioontology.org/ontology/NCI/TARGET> a owl:Class ;
skos:prefLabel """t(8;21)"""@en ;
skos:notation """TARGET"""^^xsd:string ;
skos:altLabel """Hyperdiploid; Status of 4 and 10 Unknown"""@en , """iAMP21"""@en , """inv(16)"""@en ;
UMLS:has_cui """C3897139"""^^xsd:string ;
UMLS:has_cui """C3897144"""^^xsd:string ;
UMLS:has_cui """C4086503"""^^xsd:string ;
UMLS:has_cui """C4086524"""^^xsd:string ;
UMLS:has_tui """T049"""^^xsd:string ;
UMLS:has_sty <http://purl.bioontology.org/ontology/STY/T049> ;
@saramsey Do you draw the same conclusion?
It certainly looks like owl:DatatypeProperty
's are being added as nodes: (from umls-nci.ttl
)
<http://purl.bioontology.org/ontology/NCI/GENE_ENCODES_PRODUCT> a owl:DatatypeProperty ;
rdfs:label """GENE ENCODES PRODUCT""";
rdfs:comment """Gene Encodes Product""" .
From Neo4j:
{
"iri": "https://identifiers.org/ncit:GENE_ENCODES_PRODUCT",
"category_label": "information_content_entity",
"deprecated": "False",
"name": "Gene encodes product",
"description": "COMMENTS: Gene Encodes Product",
"provided_by": "UMLS_STY:",
"id": "NCIT:GENE_ENCODES_PRODUCT",
"category": "biolink:InformationContentEntity",
"update_date": "2019"
}
The IRI doesn't resolve:
INVALID resolution request for 'ncit:GENE_ENCODES_PRODUCT', due to 'Resolution request 'ncit:GENE_ENCODES_PRODUCT' is NOT ABOUT A NAMESPACE; For namespace 'ncit', provided local ID 'GENE_ENCODES_PRODUCT' DOES NOT MATCH local IDs definition pattern '^C\d+$''
In KG2.5.2C, it is clustered with "OBI:0001617" which is classified as 'biolink:Gene' but has name 'Pubmed ID'
This is because of https://github.com/RTXteam/RTX/blob/adb30783fbd7ae09d86b01c5031aa52bb113b1a1/code/kg2/curies-to-categories.yaml#L14 which classifies everything in HGNC as a gene.
Thanks @ericawood!
Except for the node type I mentioned above, the node types below might also have this problem:
(I found them by using DSL query: match (n) where not (split(n.id,":")[1] contains "0" or split(n.id,":")[1] contains "1" or split(n.id,":")[1] contains "2" or split(n.id,":")[1] contains "3" or split(n.id,":")[1] contains "4" or split(n.id,":")[1] contains "5" or split(n.id,":")[1] contains "6" or split(n.id,":")[1] contains "7" or split(n.id,":")[1] contains "8" or split(n.id,":")[1] contains "9" ) return distinct n.category
. It is basically based on my doubt that the curie ids with the pattern of source not followed by a value might be some invalid curie ids. )
"biolink:InformationContentEntity"
"biolink:OntologyClass"
"biolink:NamedThing"
"biolink:ChemicalSubstance"
"biolink:Procedure"
"biolink:AnatomicalEntity"
"biolink:BiologicalEntity"
"biolink:Device"
"biolink:PhenotypicFeature"
"biolink:PhysicalEntity"
For example:
n.id | n.name | n.category |
---|---|---|
"OMIM:has_phenotype" | "Has phenotype" | "biolink:PhenotypicFeature" |
"OMIM:has_allelic_variant" | "Has allelic variant" | "biolink:PhenotypicFeature" |
"OMIM:has_manifestation" | "Has manifestation" | "biolink:PhenotypicFeature" |
"OMIM:MIMTYPE" | "OMIM Entry Type" | "biolink:PhenotypicFeature" |
"OMIM:MIMTYPEMEANING" | "Mimtypemeaning" | "biolink:PhenotypicFeature" |
"OMIM:GENELOCUS" | "Gene Locus" | "biolink:PhenotypicFeature" |
"OMIM:manifestation_of" | "Manifestation of" | "biolink:PhenotypicFeature" |
"OMIM:MIMTYPEVALUE" | "OMIM MimType Value" | "biolink:PhenotypicFeature" |
"OMIM:MOVED_FROM" | "Moved from" | "biolink:PhenotypicFeature" |
"OMIM:has_inheritance_type" | "Has inheritance type" | "biolink:PhenotypicFeature" |
"OMIM:allelic_variant_of" | "Allelic variant of" | "biolink:PhenotypicFeature" |
"OMIM:GENESYMBOL" | "Gene Symbol" | "biolink:PhenotypicFeature" |
"OMIM:phenotype_of" | "Phenotype of" | "biolink:PhenotypicFeature" |
Not sure if the following curie ids are normal for biolink:AnatomicalEntity
but something like channel for
or site_of
is like predicate type.
n.id | n.name | n.category |
---|---|---|
"NCIT:SENTINEL" | "Nose Swab" | "biolink:AnatomicalEntity" |
"UBERON:channel_for" | "channel for" | "biolink:AnatomicalEntity" |
"UBERON:transitively_anteriorly_connected_to" | "transitively anteriorly connected to" | "biolink:AnatomicalEntity" |
"UBERON:conduit_for" | "conduit for" | "biolink:AnatomicalEntity" |
"UBERON:filtered_through" | "filtered through" | "biolink:AnatomicalEntity" |
"UBERON:trunk_part_of" | "trunk_part_of" | "biolink:AnatomicalEntity" |
"CL:LATIN" | "latin term" | "biolink:AnatomicalEntity" |
"UBERON:indirectly_supplies" | "indirectly_supplies" | "biolink:AnatomicalEntity" |
"UBERON:protects" | "protects" | "biolink:AnatomicalEntity" |
"UBERON:transitively_distally_connected_to" | "transitively distally connected to" | "biolink:AnatomicalEntity" |
"UBERON:synapsed_by" | "synapsed by" | "biolink:AnatomicalEntity" |
"UBERON:site_of" | "site_of" | "biolink:AnatomicalEntity" |
I concur with @ericawood that the node ID HGNC:PMID
is probably an owl:DatatypeProperty that got turned into a node.
Now, in the case of UBERON:site_of
, the issue there is just that the node has the wrong category. It should be biolink:RelationshipType
. I am not sure if UBERON:site_of
is occurring somewhere as an owl:DatatypeProperty
; (I don't think it should be a datatype property); I would need to do some checking to be sure.
I concur with @ericawood; categorizing everything with the CURIE prefix HGNC
as biolink:Gene
is problematic; see RTXteam/RTX#1170. I believe that @ericawood is working on a fix in which owl:DatatypeProperty
associations can be read and understood. But in the case of HGNC:PMID
, the issue is simply that it should not be a node in the first place, because it should probably be handled via the publications
slot of the subject node for the owl:DatatypeProperty
.
I think I've found a way (without having to rely on any fix to ontobio) to filter out these owl:DatatypeProperty
nodes. You will see in the examples below that they are all categorized as "type": "PROPERTY"
. I will investigate more, but I wanted to post these findings. One potential problem with addressing this is that relation nodes (eg. RO:0000053
) will be filtered out. Is this a problem?
From OMIM:
{
"id" : "http://purl.bioontology.org/ontology/OMIM/MIMTYPE",
"meta" : {
"comments" : [ "OMIM Entry Type" ]
},
"type" : "PROPERTY",
"lbl" : "OMIM Entry Type"
}
{
"id" : "http://purl.bioontology.org/ontology/OMIM/MIMTYPEMEANING",
"meta" : {
"comments" : [ "OMIM MimType Meaning" ]
},
"type" : "PROPERTY",
"lbl" : "MIMTYPEMEANING"
}
From HGNC:
{
"id" : "http://purl.bioontology.org/ontology/HGNC/PMID",
"meta" : {
"comments" : [ "Pubmed ID" ]
},
"type" : "PROPERTY",
"lbl" : "Pubmed ID"
}
{
"id" : "https://identifiers.org/umls:has_sty",
"meta" : {
"comments" : [ "Semantic type UMLS property" ]
},
"type" : "PROPERTY",
"lbl" : "Semantic type UMLS property"
}
{
"id" : "http://purl.bioontology.org/ontology/HGNC/ENSEMBLGENE_ID",
"meta" : {
"comments" : [ "Ensembl gene ID" ]
},
"type" : "PROPERTY",
"lbl" : "Ensembl gene ID"
}
{
"id" : "http://purl.bioontology.org/ontology/HGNC/LOCUS_GROUP",
"meta" : {
"comments" : [ "Locus group" ]
},
"type" : "PROPERTY",
"lbl" : "Locus group"
}
From Uberon:
{
"id" : "http://purl.obolibrary.org/obo/uberon/core#indirectly_supplies",
"meta" : {
"definition" : {
"val" : "a indirectly_supplies s iff a has a branch and the branch supplies or indirectly supplies s",
"xrefs" : [ ]
},
"basicPropertyValues" : [ {
"pred" : "http://purl.obolibrary.org/obo/IAO_0000116",
"val" : "add to RO"
}, {
"pred" : "http://www.geneontology.org/formats/oboInOwl#hasOBONamespace",
"val" : "uberon"
} ]
},
"type" : "PROPERTY",
"lbl" : "indirectly_supplies"
}
{
"id" : "http://purl.obolibrary.org/obo/uberon/core#transitively_anteriorly_connected_to",
"meta" : {
"definition" : {
"val" : ".",
"xrefs" : [ "http://purl.obolibrary.org/obo/uberon/docs/Connectivity-Design-Pattern" ]
},
"basicPropertyValues" : [ {
"pred" : "http://www.geneontology.org/formats/oboInOwl#hasOBONamespace",
"val" : "uberon"
} ]
},
"type" : "PROPERTY",
"lbl" : "transitively anteriorly connected to"
}
Regarding my previous comment, I noticed that this code already exists in multi_ont_to_json_kg.py
:
https://github.com/RTXteam/RTX/blob/78b8565ed70de882796f25d948bf18524728bf7b/code/kg2/multi_ont_to_json_kg.py#L725-L728
The problem is that only nodes without a category label are handled by that code block. Per this code:
https://github.com/RTXteam/RTX/blob/33b50ae7ce9c42b8dbc19fc5be86990a4b38cfbc/code/kg2/curies-to-categories.yaml#L2-L35
nodes from many of the sources listed above (including OMIM, HGNC, and UBERON) never reach that code block. I am thinking of removing the if node_category_label is None:
requirement. @saramsey does that seem reasonable?
Regarding my previous comment, I noticed that this code already exists in
multi_ont_to_json_kg.py
:The problem is that only nodes without a category label are handled by that code block. Per this code:
nodes from many of the sources listed above (including OMIM, HGNC, and UBERON) never reach that code block. I am thinking of removing the
if node_category_label is None:
requirement. @saramsey does that seem reasonable?
Seems reasonable. I think this is a good example of where a test build (to sanity check) would be helpful. One build with the change, and one without. Can then compare.
Outstanding sleuthing, @ericawood !
I only tested it on biolink-model.owl.ttl
, umls-hgnc.ttl
, and umls-omim.ttl
, but it does appear that that fix introduced an unintended bug do to the following line. Essentially, the source for any of these PROPERTY
nodes is now UMLS_STY
.
https://github.com/RTXteam/RTX/blob/36699fb0285c261ec2adfac6436dabddfaccc9e2/code/kg2/multi_ont_to_json_kg.py#L805-L806
(when viewing that line, please remember than BIOLINK_CATEGORY_ATTRIBUTE is now "information content entity" per 2f48bb6)
Here is what the old HGNC:PMID
node looked like:
{
"category": "biolink:Gene",
"category_label": "gene",
"creation_date": null,
"deprecated": false,
"description": "COMMENTS: Pubmed ID",
"full_name": null,
"id": "HGNC:PMID",
"iri": "https://identifiers.org/hgnc:PMID",
"name": "Pubmed ID",
"provided_by": "umls_source:HGNC",
"publications": [],
"replaced_by": null,
"synonym": [],
"update_date": "2019"
},
Here is what the new HGNC:PMID
node looks like. Note that it's provided_by
field is UMLS_STY:
rather than umls_source:HGNC
as it was before.
{
"category": "biolink:InformationContentEntity",
"category_label": "information_content_entity",
"creation_date": null,
"deprecated": false,
"description": "COMMENTS: Pubmed ID",
"full_name": null,
"id": "HGNC:PMID",
"iri": "https://identifiers.org/hgnc:PMID",
"name": "Pubmed ID",
"provided_by": "UMLS_STY:",
"publications": [],
"replaced_by": null,
"synonym": [],
"update_date": "2019"
},
In addition, nodes from the biolink-model.owl.ttl
files that were previously biolink:OntologyClass
's are now biolink:InformationContentEntity
's.
@saramsey What are your thoughts on this?
What happens if you comment out L805-806?
In addition, nodes from the biolink-model.owl.ttl files that were previously biolink:OntologyClass's are now biolink:InformationContentEntity's.
Actually I think this is a good thing. I just checked and biolink:OntologyClass
is actually a mixin
https://github.com/biolink/biolink-model/blob/bd3607404bae3677bc8fa6de16067c8abfab56b6/biolink-model.yaml#L4869
so it is best if we do not use it. I think biolink:InformationContentEntity
is a good substitute to use. Nice work!
What happens if you comment out L805-806?
This appeared to fix that issue:
{
"category": "biolink:InformationContentEntity",
"category_label": "information_content_entity",
"creation_date": null,
"deprecated": false,
"description": "COMMENTS: Pubmed ID",
"full_name": null,
"id": "HGNC:PMID",
"iri": "https://identifiers.org/hgnc:PMID",
"name": "Pubmed ID",
"provided_by": "umls_source:HGNC",
"publications": [],
"replaced_by": null,
"synonym": [],
"update_date": "2019"
},
I'll commit the change shortly.
This looks mostly but not all fixed in KG2.6.0
:
match (n) where n.id in ["HGNC:PMID", "NCBITaxon:subclass", "NCBITaxon:has_rank", "NCBITaxon:in_part", "NCBITaxon:infraorder", "NCBITaxon:subfamily", "NCBITaxon:genbank_common_name", "NCBITaxon:misnomer", "NCBITaxon:superkingdom", "NCBITaxon:RANK", "NCBITaxon:DIV", "NCIT:MSTS", "NCIT:CDNH", "NCIT:HNH", "NCIT:SPAAT", "NCIT:ePRO", "NCIT:BBPS", "NCIT:CPTAC", "NCIT:MPSImP", "NCIT:BIRADS", "RXNORM:contained_in", "RXNORM:RXN_BN_CARDINALITY", "RXNORM:RXN_STRENGTH", "RXNORM:ingredient_of", "RXNORM:RXN_QUALITATIVE_DISTINCTION", "RXNORM:precise_ingredient_of", "RXNORM:RXN_BOSS_AM", "RXNORM:RXN_BOSS_AI", "RXNORM:RXN_BOSS_FROM", "NDDF:FL", "NCIT:TARGET", "NCIT:Alliance", "ICD10:CODE_ALSO", "ICD10:ORDER_NO", "ICD10:NOTE", "ICD10:CODE_FIRST", "ICD10:SIB", "ICD10:USE_ADDITIONAL", "PR:PRO-common-name", "PR:PRO-proteoform-ftid", "PR:PRO-proteoform-std", "PR:lacks_part", "PR:has_gene_template", "PR:PSI-MOD-label", "HGNC:GENESYMBOL", "HGNC:MAPPED_UCSC_ID", "HGNC:LOCUS_GROUP", "HGNC:EZ", "HGNC:ENTREZGENE_ID", "HGNC:PREV_SYMBOL", "HGNC:OMIM_ID", "HGNC:DATE_NAME_CHANGED"] return n.id, n.category_label
n.id | n.category_label |
---|---|
"HGNC:PREV_SYMBOL" | "information_content_entity" |
"HGNC:DATE_NAME_CHANGED" | "information_content_entity" |
"HGNC:EZ" | "information_content_entity" |
"HGNC:LOCUS_GROUP" | "information_content_entity" |
"HGNC:OMIM_ID" | "information_content_entity" |
"HGNC:MAPPED_UCSC_ID" | "information_content_entity" |
"HGNC:ENTREZGENE_ID" | "information_content_entity" |
"HGNC:GENESYMBOL" | "information_content_entity" |
"HGNC:PMID" | "information_content_entity" |
"ICD10:USE_ADDITIONAL" | "information_content_entity" |
"ICD10:SIB" | "information_content_entity" |
"ICD10:CODE_FIRST" | "information_content_entity" |
"ICD10:NOTE" | "information_content_entity" |
"ICD10:ORDER_NO" | "information_content_entity" |
"ICD10:CODE_ALSO" | "information_content_entity" |
"NCBITaxon:DIV" | "information_content_entity" |
"NCBITaxon:RANK" | "information_content_entity" |
"NCIT:BIRADS" | "disease_or_phenotypic_feature" |
"NCIT:Alliance" | "disease" |
"NCIT:MPSImP" | "disease_or_phenotypic_feature" |
"NCIT:CPTAC" | "disease_or_phenotypic_feature" |
"NCIT:BBPS" | "disease_or_phenotypic_feature" |
"NCIT:ePRO" | "disease_or_phenotypic_feature" |
"NCIT:SPAAT" | "disease_or_phenotypic_feature" |
"NCIT:CDNH" | "disease_or_phenotypic_feature" |
"NCIT:HNH" | "disease_or_phenotypic_feature" |
"NCIT:MSTS" | "disease_or_phenotypic_feature" |
"NCIT:TARGET" | "disease" |
"NDDF:FL" | "drug" |
"RXNORM:RXN_BOSS_FROM" | "information_content_entity" |
"RXNORM:RXN_BOSS_AI" | "information_content_entity" |
"RXNORM:RXN_BOSS_AM" | "information_content_entity" |
"RXNORM:precise_ingredient_of" | "information_content_entity" |
"RXNORM:RXN_QUALITATIVE_DISTINCTION" | "information_content_entity" |
"RXNORM:ingredient_of" | "information_content_entity" |
"RXNORM:RXN_STRENGTH" | "information_content_entity" |
"RXNORM:RXN_BN_CARDINALITY" | "information_content_entity" |
"RXNORM:contained_in" | "information_content_entity" |
"PR:lacks_part" | "information_content_entity" |
"PR:has_gene_template" | "information_content_entity" |
"NCBITaxon:subclass" | "organism_taxon" |
"NCBITaxon:superkingdom" | "organism_taxon" |
"NCBITaxon:infraorder" | "organism_taxon" |
"NCBITaxon:in_part" | "information_content_entity" |
"NCBITaxon:misnomer" | "information_content_entity" |
"NCBITaxon:genbank_common_name" | "information_content_entity" |
"NCBITaxon:subfamily" | "organism_taxon" |
"PR:PRO-proteoform-ftid" | "information_content_entity" |
"PR:PRO-proteoform-std" | "information_content_entity" |
"PR:PRO-common-name" | "information_content_entity" |
"PR:PSI-MOD-label" | "information_content_entity" |
"NCBITaxon:has_rank" | "information_content_entity" |
Just found some curies which might be invalid curies in KG2.5.2. Most of them are isolated curies but some of them are clustered with other curies in KG2.5.2C and has links with other curies.
Here is one example:
This curie is "HGNC:PMID"
In KG2.5.2C, it is clustered with "OBI:0001617" which is classified as 'biolink:Gene' but has name 'Pubmed ID'
In KG2.5.2, I found lots of curies like this case: