RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
39 stars 8 forks source link

What should be the category labels for nodes that represent gene families or protein families? #268

Closed dkoslicki closed 1 year ago

dkoslicki commented 5 years ago

There are a number of (highly connected) nodes of certain bioentity type (eg. "gene") are not actually that bioentity type, but rather a concept involving that type. For example, the concept of a gene is not an actual gene. match (n:gene{id:"CUI:C0017337"}) return n

Also on next week's agenda.

saramsey commented 5 years ago

Thanks for the issue report. Classifying this as a bug. That node is a total mess.

saramsey commented 5 years ago

Proposal: invent a category "gene set" and use it in NCIT and other ontologies where we have terms that aggregate multiple genes.

saramsey commented 5 years ago

look at DOID:225; this is a "disease category"

dkoslicki commented 5 years ago

Also, see related See also http://disease-ontology.org/term/DOID%3A225/ match (n:disease{id:"DOID:225"}) return n

saramsey commented 5 years ago

Steve to study the technical feasibility of this

dkoslicki commented 5 years ago

See also nodes like: match (n:gene) where n.name="protein_coding_gene" return n i.e. a node with the name protein_coding_gene with ID SO:0001217

dkoslicki commented 4 years ago

@saramsey close due to RTXteam/RTX#788?

saramsey commented 4 years ago

not ready to close this yet

saramsey commented 4 years ago

@dkoslicki is this still an issue? Do you have any Cypher examples?

I just ran a test (on kg2endpoint.rtx.ai) which shows that the SO node with name gene is now showing up as gene grouping:

Screen Shot 2020-07-16 at 2 57 35 PM
saramsey commented 4 years ago

Looks like SO:0001217 is coming back as having category label of named thing which I am not sure is a great improvement?

Would appreciate some guidance here on the extent to which the above address the issue. Is this still a pain point for your reasoning code?

saramsey commented 4 years ago

So, UMLS:C0017337 is coming back as having a category label of genomic entity. So, uh, do I get points for variety?

Screen Shot 2020-07-16 at 3 01 45 PM
edeutsch commented 4 years ago

BioLink says that the parent of gene is gene_or_gene_product: https://biolink.github.io/biolink-model/docs/Gene

saramsey commented 4 years ago

I've updated curies-to-categories.yaml to map SO:0001217 to the category label of gene.

@edeutsch thank you for pointing that out; that may be useful in cases where we want to represent a gene and its product(s) using a single concept.

saramsey commented 4 years ago

I note that UMLS:C0017337 (at least in the HL7 source dataset) is annotated as having the UMLS semantic type code T028 (Gene or Genome)

<http://purl.bioontology.org/ontology/HL7/C0017337> a owl:Class ;
        skos:prefLabel """gene"""@en ;
        skos:notation """C0017337"""^^xsd:string ;
        skos:definition """<p><b>Description:</b>A DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology</p>"""@en ;
        rdfs:subClassOf <http://purl.bioontology.org/ontology/HL7/C3243737> ;
        <http://purl.bioontology.org/ontology/HL7/HL7CS> """active"""^^xsd:string ;
        <http://purl.bioontology.org/ontology/HL7/HL7ID> """22651"""^^xsd:string ;
        <http://purl.bioontology.org/ontology/HL7/HL7PL> """true"""^^xsd:string ;
        UMLS:has_cui """C0017337"""^^xsd:string ;
        UMLS:has_tui """T028"""^^xsd:string ;
        UMLS:has_sty <http://purl.bioontology.org/ontology/STY/T028> ;

So since T028 is Gene or Genome, KG2 maps it to the Biolink category genomic entity which encompasses both. I'm not sure what I should do differently, vis-a-vis the category label mapping for C0017337.

saramsey commented 4 years ago

I note that the name attribute of UMLS:C0017337 is gene (singular)

https://ncim.nci.nih.gov/ncimbrowser/pages/concept_details.jsf?type=sources&code=C0017337&sab=CSP&sourcecode=1256-5501

I think it refers to an abstract gene (singular), not a family of genes per se.

saramsey commented 4 years ago

closing this issue as it doesn't appear there is anything else actionable at this point; happy to re-open it if there is more that I should do

saramsey commented 3 years ago

@chunyuma inquired (on 4/7):

... In KG2.5.2, there is a node with curie id “SO:0001217”, which I think is hard-coded as “biolink:Gene” based on the RTX/code/kg2/curies-to-categories.yaml. But the name of this node is “protein_coding_gene” which seems like a generic concept rather than a specific gene. So I’m curious why this node is hard-coded as “biolink:Gene” rather than a more generic category like “biolink:GenomicEnitity” or even “biolink:NamedThing”. Actually, I was confused about this question for a long time because I did find some “generic” concepts in some specific “biolink” categories. Sorry, I can’t find anther example to show you here now. But I’m just curious about how to determined the hard-coded category.

saramsey commented 3 years ago

Chunyu brings up a good point (that has been previously raised by @dkoslicki). Let's revisit the question, the Biolink metamodel, what exactly is the semantics of the relationship between a node and it's category? To respond to this, I will lean on empirical evidence since that is what I have available. In the KGX Format Page, we see that MONDO:0005002 (chronic obstructive pulmonary disease) is a node with category biolink:Disease.

Screen Shot 2021-04-09 at 11 00 03 AM

And yet, MONDO:0005002 is not a leaf node, it has subclasses, as can be seen from EBI OLS: https://www.ebi.ac.uk/ols/ontologies/mondo/terms?short_form=MONDO_0005002

Screen Shot 2021-04-09 at 11 00 47 AM

So MONDO:0005002 is a concept representing a collection of more specific disease types, and yet, it has Biolink category biolink:Disease, just like leaf disease types such as MONDO:0011751 (COPD, severe early onset), as shown in this result from the SRI Node Normalization Service:

Screen Shot 2021-04-09 at 11 03 09 AM

The same semantic for node-to-category can be seen in the Gene Ontology. Consider the non-leaf concept GO:0048514 (blood vessel morphogenesis) and the left concept GO:0001525 (angiogenesis).

Screen Shot 2021-04-09 at 11 07 35 AM

Both of those GO terms have category biolink:BiologicalProcess:

Screen Shot 2021-04-09 at 11 06 37 AM

and

Screen Shot 2021-04-09 at 11 08 13 AM

If we accept the above as valid in the Biolink metamodel, on what basis would we object to having both a non-leaf and a leaf concept for gene, to have the category annotation biolink:Gene? I guess, what in the Biolink metamodel encodes that the node-to-category semantic is different for genes than for biological processes or diseases?

Let's look at the Biolink metamodel's definition of the category slot:

Screen Shot 2021-04-09 at 11 12 39 AM

So the domain is entity, which certainly seems like it could encompass both a basket-type concept like SO:0000704 as well as a leaf-type concept like HGNC:12345. I see no reason why those are required to be mutually exclusive, and indeed if they were required to be so, the requirement is clearly only for certain categories, per the above examples.

To my mind, the strongest empirical evidence that biolink:Gene is the correct category for a basket type node like SO:0000704 comes from the SRI Reference KG itself. I queried the SRI Reference KG for SO:0000704 to see what is the Biolink category that the reference KG assigns to SO:0000704, and lo and behold, it is biolink:Gene.

[
  {
    "iri": "http://purl.obolibrary.org/obo/SO_0000704",
    "synonym": [
      "INSDC_feature:gene"
    ],
    "xref": "http://en.wikipedia.org/wiki/Gene",
    "name": "gene",
    "description": "A region (or regions) that includes all of the sequence elements necessary to encode a functional transcript. A gene may include regulatory regions, transcribed regions and/or other functional sequence regions.",
    "provided_by": [
      "monarch-ontologies",
      "panther",
      "animalqtldb",
      "bgee",
      "flybase",
      "go",
      "impc",
      "kegg",
      "mgi",
      "mgislim",
      "mmrrc",
      "omim",
      "string",
      "wormbase",
      "zfin"
    ],
    "id": "SO:0000704",
    "category": [
      "biolink:SequenceFeature",
      "biolink:GenomicEntity",
      "biolink:NamedThing",
      "biolink:Gene"
    ],
    "subsets": "SOFA"
  }
]

The same goes for SO:0001217 (protein-coding gene):


  {
    "iri": "http://purl.obolibrary.org/obo/SO_0001217",
    "synonym": [
      "protein coding gene"
    ],
    "name": "protein_coding_gene",
    "provided_by": [
      "ncbigene",
      "monarch-ontologies",
      "ensembl",
      "hgnc",
      "mgi",
      "omia",
      "wormbase",
      "zfin"
    ],
    "description": "A gene that codes for an RNA that can be translated into a protein.",
    "id": "SO:0001217",
    "category": [
      "biolink:NamedThing",
      "biolink:GenomicEntity",
      "biolink:Gene",
      "biolink:SequenceFeature"
    ],
    "subsets": "Alliance_of_Genome_Resources"
  }
]
chunyuma commented 3 years ago

Thanks @saramsey for revisiting this issue and looking into more details about this issue. The reason why I raised this issue is actually from the explainable DTD model. We hope to integrate more biological feature info (eg. gene sequence for biolink:Gene, protein sequence for biolink:Protein and smiles sequence for biolink:Drug or biolink:ChemicalSubstance) into the model. For some non-leaf nodes like SO:0000704, if we consider them as the same node type as the leaf-type concepts. It is not easy to assign the biological feature info to them. For example, SO:0000704 has no gene sequence. So that's why I think it might be better to put these generic concepts to some Entity node types.

saramsey commented 3 years ago

I think it might be better to put these generic concepts to some Entity node types.

Understood, but in light of the above evidence, I don't think we can do that in KG2/KG2C and still be Biolink standard-compliant (and conforming to the Biolink standard is one of the things we agreed to in our contract with NIH). In DTD, is it not possible to just ignore any node from the SO, or any node whose category is biolink:Protein or biolink:Gene and that has biolink:subclass_of descendants?

saramsey commented 3 years ago

The other option might be to build a modified KG (presumably derived from KG2C) just for DTD. It would not have to be Biolink standard-compliant because it would not be exposed as a KP.

saramsey commented 3 years ago

@dkoslicki, what are your thoughts?

chunyuma commented 3 years ago

The other option might be to build a modified KG (presumably derived from KG2C) just for DTD. It would not have to be Biolink standard-compliant because it would not be exposed as a KP.

Thanks for this suggestion @saramsey. This is actually what I'm doing right now. It makes sense to keep the original node type for these generic concepts in KG2/KG2c in order to follow Biolink standard. But can I get some idea from you about an easy way to identify these nodes in KG2? Can I consider that all curies which are hard-coded in this yaml file might have high probability to be these generic concepts

dkoslicki commented 3 years ago

@chunyuma @saramsey I think the option, that Chunyu is doing, to remove those nodes for the DTD specific KG2/C, as that KG does not need to be standards compliant.

Interestingly though, leaving the node "gene" as labeled with "gene" will result in the following: if you take a single (actual) gene and ask for all genes connected in two hops, you will get all genes back (since each connects to the node named "gene"). I wonder if it's worth bringing this up to the Biolink people that root nodes for a specific category might require a different category. Eg. Root node of "gene" category get's the category "gene category" instead of the current "gene"

saramsey commented 3 years ago

if you take a single (actual) gene and ask for all genes connected in two hops, you will get all genes back (since each connects to the node named "gene").

Good point. That is... not ideal. I can see a couple of possible countermeasures:

  1. constrain the allowed relations (predicates) for the two-hop query
  2. give ARAX the ability to match two-hop queries while avoiding intermediate "super-hub" nodes whose degree is greater than some threshold like 100.
saramsey commented 3 years ago

I kinda suspect that option (2) may have other uses, outside of just the "Gene" situation.

dkoslicki commented 3 years ago

I like option (2) as well. That’s one of the benefits of using the Fisher exact test (avoid hubs). But making this “avoid hubs” option explicit would be great! Would this be an expand thing, or a filter thing?

amykglen commented 3 years ago

that's a cool idea about avoiding 'super-hubs'! I could see that going in expand... recently I've been wondering about incorporating something similar directly into plover itself, due to how aggressive the combinatorial explosion is with KG2c... would be pretty cool to avoid having to return nodes/edges that will just be filtered out, and plover can easily have constant time access to node degrees and etc...

saramsey commented 3 years ago

that's a cool idea about avoiding 'super-hubs'! I could see that going in expand... recently I've been wondering about incorporating something similar directly into plover itself, due to how aggressive the combinatorial explosion is with KG2c... would be pretty cool to avoid having to return nodes/edges that will just be filtered out, and plover can easily have constant time access to node degrees and etc...

+1 for constant time access to node degrees

saramsey commented 3 years ago

tagging the filter crew and expand crew to work out the details

kvarforl commented 3 years ago

removing the kg2 label as this seems to have pivoted to more of a filter/ expand issue. feel free to add back if I've misinterpreted :)

finnagin commented 2 years ago

Removing myself. Also, is this still relevant @saramsey @dkoslicki ?

saramsey commented 1 year ago

transferred to RTX-KG2 repo