Closed dkoslicki closed 1 year ago
Thanks for the issue report. Classifying this as a bug. That node is a total mess.
Proposal: invent a category "gene set" and use it in NCIT and other ontologies where we have terms that aggregate multiple genes.
look at DOID:225; this is a "disease category"
Also, see related
See also http://disease-ontology.org/term/DOID%3A225/
match (n:disease{id:"DOID:225"}) return n
Steve to study the technical feasibility of this
See also nodes like: match (n:gene) where n.name="protein_coding_gene" return n
i.e. a node with the name protein_coding_gene
with ID SO:0001217
@saramsey close due to RTXteam/RTX#788?
not ready to close this yet
@dkoslicki is this still an issue? Do you have any Cypher examples?
I just ran a test (on kg2endpoint.rtx.ai
) which shows that the SO
node with name gene
is now showing up as gene grouping
:
Looks like SO:0001217
is coming back as having category label
of named thing
which I am not sure is a great improvement?
Would appreciate some guidance here on the extent to which the above address the issue. Is this still a pain point for your reasoning code?
So, UMLS:C0017337
is coming back as having a category label
of genomic entity
. So, uh, do I get points for variety?
BioLink says that the parent of gene is gene_or_gene_product: https://biolink.github.io/biolink-model/docs/Gene
I've updated curies-to-categories.yaml
to map SO:0001217
to the category label
of gene
.
@edeutsch thank you for pointing that out; that may be useful in cases where we want to represent a gene and its product(s) using a single concept.
I note that UMLS:C0017337
(at least in the HL7 source dataset) is annotated as having the UMLS semantic type code T028
(Gene or Genome
)
<http://purl.bioontology.org/ontology/HL7/C0017337> a owl:Class ;
skos:prefLabel """gene"""@en ;
skos:notation """C0017337"""^^xsd:string ;
skos:definition """<p><b>Description:</b>A DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology</p>"""@en ;
rdfs:subClassOf <http://purl.bioontology.org/ontology/HL7/C3243737> ;
<http://purl.bioontology.org/ontology/HL7/HL7CS> """active"""^^xsd:string ;
<http://purl.bioontology.org/ontology/HL7/HL7ID> """22651"""^^xsd:string ;
<http://purl.bioontology.org/ontology/HL7/HL7PL> """true"""^^xsd:string ;
UMLS:has_cui """C0017337"""^^xsd:string ;
UMLS:has_tui """T028"""^^xsd:string ;
UMLS:has_sty <http://purl.bioontology.org/ontology/STY/T028> ;
So since T028
is Gene or Genome
, KG2 maps it to the Biolink category genomic entity
which encompasses both. I'm not sure what I should do differently, vis-a-vis the category label mapping for C0017337
.
I note that the name
attribute of UMLS:C0017337
is gene
(singular)
I think it refers to an abstract gene (singular), not a family of genes per se.
closing this issue as it doesn't appear there is anything else actionable at this point; happy to re-open it if there is more that I should do
@chunyuma inquired (on 4/7):
... In KG2.5.2, there is a node with curie id “SO:0001217”, which I think is hard-coded as “biolink:Gene” based on the RTX/code/kg2/curies-to-categories.yaml. But the name of this node is “protein_coding_gene” which seems like a generic concept rather than a specific gene. So I’m curious why this node is hard-coded as “biolink:Gene” rather than a more generic category like “biolink:GenomicEnitity” or even “biolink:NamedThing”. Actually, I was confused about this question for a long time because I did find some “generic” concepts in some specific “biolink” categories. Sorry, I can’t find anther example to show you here now. But I’m just curious about how to determined the hard-coded category.
Chunyu brings up a good point (that has been previously raised by @dkoslicki). Let's revisit the question, the Biolink metamodel, what exactly is the semantics of the relationship between a node and it's category? To respond to this, I will lean on empirical evidence since that is what I have available. In the KGX Format Page, we see that MONDO:0005002
(chronic obstructive pulmonary disease) is a node with category biolink:Disease
.
And yet, MONDO:0005002
is not a leaf node, it has subclasses, as can be seen from EBI OLS:
https://www.ebi.ac.uk/ols/ontologies/mondo/terms?short_form=MONDO_0005002
So MONDO:0005002
is a concept representing a collection of more specific disease types, and yet, it has Biolink category biolink:Disease
, just like leaf disease types such as MONDO:0011751
(COPD, severe early onset), as shown in this result from the SRI Node Normalization Service:
The same semantic for node-to-category can be seen in the Gene Ontology. Consider the non-leaf concept GO:0048514
(blood vessel morphogenesis) and the left concept GO:0001525
(angiogenesis).
Both of those GO terms have category biolink:BiologicalProcess
:
and
If we accept the above as valid in the Biolink metamodel, on what basis would we object to having both a non-leaf and a leaf concept for gene, to have the category annotation biolink:Gene
? I guess, what in the Biolink metamodel encodes that the node-to-category semantic is different for genes than for biological processes or diseases?
Let's look at the Biolink metamodel's definition of the category
slot:
So the domain is entity
, which certainly seems like it could encompass both a basket-type concept like SO:0000704
as well as a leaf-type concept like HGNC:12345
. I see no reason why those are required to be mutually exclusive, and indeed if they were required to be so, the requirement is clearly only for certain categories, per the above examples.
To my mind, the strongest empirical evidence that biolink:Gene
is the correct category for a basket type node like SO:0000704
comes from the SRI Reference KG itself. I queried the SRI Reference KG for SO:0000704
to see what is the Biolink category that the reference KG assigns to SO:0000704
, and lo and behold, it is biolink:Gene
.
[
{
"iri": "http://purl.obolibrary.org/obo/SO_0000704",
"synonym": [
"INSDC_feature:gene"
],
"xref": "http://en.wikipedia.org/wiki/Gene",
"name": "gene",
"description": "A region (or regions) that includes all of the sequence elements necessary to encode a functional transcript. A gene may include regulatory regions, transcribed regions and/or other functional sequence regions.",
"provided_by": [
"monarch-ontologies",
"panther",
"animalqtldb",
"bgee",
"flybase",
"go",
"impc",
"kegg",
"mgi",
"mgislim",
"mmrrc",
"omim",
"string",
"wormbase",
"zfin"
],
"id": "SO:0000704",
"category": [
"biolink:SequenceFeature",
"biolink:GenomicEntity",
"biolink:NamedThing",
"biolink:Gene"
],
"subsets": "SOFA"
}
]
The same goes for SO:0001217
(protein-coding gene):
{
"iri": "http://purl.obolibrary.org/obo/SO_0001217",
"synonym": [
"protein coding gene"
],
"name": "protein_coding_gene",
"provided_by": [
"ncbigene",
"monarch-ontologies",
"ensembl",
"hgnc",
"mgi",
"omia",
"wormbase",
"zfin"
],
"description": "A gene that codes for an RNA that can be translated into a protein.",
"id": "SO:0001217",
"category": [
"biolink:NamedThing",
"biolink:GenomicEntity",
"biolink:Gene",
"biolink:SequenceFeature"
],
"subsets": "Alliance_of_Genome_Resources"
}
]
Thanks @saramsey for revisiting this issue and looking into more details about this issue. The reason why I raised this issue is actually from the explainable DTD model. We hope to integrate more biological feature info (eg. gene sequence for biolink:Gene
, protein sequence for biolink:Protein
and smiles sequence for biolink:Drug
or biolink:ChemicalSubstance
) into the model. For some non-leaf nodes like SO:0000704
, if we consider them as the same node type as the leaf-type concepts. It is not easy to assign the biological feature info to them. For example, SO:0000704
has no gene sequence. So that's why I think it might be better to put these generic
concepts to some Entity
node types.
I think it might be better to put these generic concepts to some Entity node types.
Understood, but in light of the above evidence, I don't think we can do that in KG2/KG2C and still be Biolink standard-compliant (and conforming to the Biolink standard is one of the things we agreed to in our contract with NIH). In DTD, is it not possible to just ignore any node from the SO, or any node whose category is biolink:Protein
or biolink:Gene
and that has biolink:subclass_of
descendants?
The other option might be to build a modified KG (presumably derived from KG2C) just for DTD. It would not have to be Biolink standard-compliant because it would not be exposed as a KP.
@dkoslicki, what are your thoughts?
The other option might be to build a modified KG (presumably derived from KG2C) just for DTD. It would not have to be Biolink standard-compliant because it would not be exposed as a KP.
Thanks for this suggestion @saramsey. This is actually what I'm doing right now. It makes sense to keep the original node type for these generic
concepts in KG2/KG2c in order to follow Biolink standard. But can I get some idea from you about an easy way to identify these nodes in KG2? Can I consider that all curies which are hard-coded in this yaml file might have high probability to be these generic
concepts
@chunyuma @saramsey I think the option, that Chunyu is doing, to remove those nodes for the DTD specific KG2/C, as that KG does not need to be standards compliant.
Interestingly though, leaving the node "gene" as labeled with "gene" will result in the following: if you take a single (actual) gene and ask for all genes connected in two hops, you will get all genes back (since each connects to the node named "gene"). I wonder if it's worth bringing this up to the Biolink people that root nodes for a specific category might require a different category. Eg. Root node of "gene" category get's the category "gene category" instead of the current "gene"
if you take a single (actual) gene and ask for all genes connected in two hops, you will get all genes back (since each connects to the node named "gene").
Good point. That is... not ideal. I can see a couple of possible countermeasures:
I kinda suspect that option (2) may have other uses, outside of just the "Gene" situation.
I like option (2) as well. That’s one of the benefits of using the Fisher exact test (avoid hubs). But making this “avoid hubs” option explicit would be great! Would this be an expand
thing, or a filter
thing?
that's a cool idea about avoiding 'super-hubs'! I could see that going in expand
... recently I've been wondering about incorporating something similar directly into plover itself, due to how aggressive the combinatorial explosion is with KG2c... would be pretty cool to avoid having to return nodes/edges that will just be filtered out, and plover can easily have constant time access to node degrees and etc...
that's a cool idea about avoiding 'super-hubs'! I could see that going in
expand
... recently I've been wondering about incorporating something similar directly into plover itself, due to how aggressive the combinatorial explosion is with KG2c... would be pretty cool to avoid having to return nodes/edges that will just be filtered out, and plover can easily have constant time access to node degrees and etc...
+1 for constant time access to node degrees
tagging the filter crew and expand crew to work out the details
removing the kg2 label as this seems to have pivoted to more of a filter/ expand issue. feel free to add back if I've misinterpreted :)
Removing myself. Also, is this still relevant @saramsey @dkoslicki ?
transferred to RTX-KG2 repo
There are a number of (highly connected) nodes of certain bioentity type (eg. "gene") are not actually that bioentity type, but rather a concept involving that type. For example, the concept of a gene is not an actual gene.
match (n:gene{id:"CUI:C0017337"}) return n
Also on next week's agenda.