RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
36 stars 8 forks source link

change biolink category for microRNAs that is assigned in ncbigene_tsv_to_kg_json.py #3

Closed saramsey closed 3 years ago

saramsey commented 3 years ago

There is probably a simple explanation for this. Mostly likely I've just forgotten, in my old age. But I could use a refresher. So in /home/ubuntu/kg2-build/kg2-ncbigene.json on kg2lindsey.rtx.ai I see the following JSON blob for the node with ID NCBIGene:113839523:

        {
            "id": "NCBIGene:113839523",
            "iri": "https://identifiers.org/ncbigene:113839523",
            "name": "Genetic locus associated with MIR11400",
            "full_name": "microRNA 11400",
            "category": "biolink:MicroRNA",
            "category_label": "microRNA",
            "description": "Type:ncRNA; Locus:7q34; NameStatus:official",
            "synonym": [
                "MIR11400",
                "hsa-mir-11400"
            ],
            "publications": [],
            "creation_date": null,
            "update_date": "20210302",
            "deprecated": false,
            "replaced_by": null,
            "provided_by": "identifiers_org_registry:ncbigene",
            "has_biological_sequence": null
        },

But somehow, in the KG2.5.2 Neo4j, the node's Biolink category has changed to biolink:Gene:

{
  "iri": "https://identifiers.org/ncbigene:113839523",
  "synonym": [
    "MIR11400",
    "hsa-mir-11400"
  ],
  "category_label": "gene",
  "full_name": "microRNA 11400",
  "deprecated": "False",
  "name": "MIR11400",
  "description": "Type:ncRNA; Locus:7q34; NameStatus:official",
  "provided_by": "identifiers_org_registry:ncbigene",
  "id": "NCBIGene:113839523",
  "category": "biolink:Gene",
  "update_date": "20210302"
}
saramsey commented 3 years ago

Makes me wonder if this is due to the merge step. So I did a grep:

grep -c NCBIGene:113839523 kg2-*.json
kg2-chembl.json:0
kg2-dgidb.json:0
kg2-disgenet.json:0
kg2-drugbank.json:0
kg2-ensembl.json:0
kg2-go-annotation.json:0
kg2-hmdb.json:0
kg2-intact.json:0
kg2-jensenlab.json:0
kg2-mirbase.json:0
kg2-ncbigene.json:1
kg2-reactome.json:0
kg2-repodb.json:0
kg2-smpdb.json:0
kg2-unichem.json:0
kg2-uniprotkb.json:0
ecwood commented 3 years ago

Since the kg2 build hasn't finished yet, you don't have kg2-ont.json in your grep. The Biolink category of that node is most likely being overwritten by this line: https://github.com/RTXteam/RTX/blob/525c362b0a9fb4dcf82ad8f025a0e3fd0936bbd5/code/kg2/curies-to-categories.yaml#L19

saramsey commented 3 years ago

Thank you, @ericawood!

ecwood commented 3 years ago

Do you think it would be preferred to have multi_ont_to_kg_json.py node categories favored least in the merge process?

saramsey commented 3 years ago

Actually I think biolink:Gene seems correct here. A record in NCBI Gene, I suppose, is documenting the gene. In the case of NCBIGene:113839523, it so happens that the gene encodes a microRNA (hsa-miR-11400), and that can be encoded in various ways like using a has_gene_product relationship to a separate node representing the microRNA. But I think maybe we should change L118 of ncbigene_tsv_to_kg_json.py to hard-code the Biolink category in that case (a microRNA) to be biolink:Gene. https://github.com/RTXteam/RTX/blob/525c362b0a9fb4dcf82ad8f025a0e3fd0936bbd5/code/kg2/ncbigene_tsv_to_kg_json.py#L118

saramsey commented 3 years ago

That change should have no affect on KG2, since KG2 currently has no nodes in it with category biolink:MicroRNA (I've checked). But it is better if we set it to the proper and accepted Biolink category in ncbigene_tsv_to_kg_json.py, so we are not relying on the merge to overwrite the category with the "correct" category (which I guess should be biolink:Gene). Does that make sense?

saramsey commented 3 years ago

Clearly, this is not an urgent matter. It can absolutely be rolled into KG2.6.1.

saramsey commented 3 years ago

Do you think it would be preferred to have multi_ont_to_kg_json.py node categories favored least in the merge process?

Good question. One could argue that via curies-to-categories.json, we have fairly fine-grained control over the Biolink category assignments for concepts that arise as nodes in kg2-ont.json, and furthermore, that the CURIE to category assignments in that file can in principle be validated against the Biolink metamodel. So I guess maybe it is good that kg2-ont.json is loaded first. What do you think?

saramsey commented 3 years ago

In the ETL script mirbase_dat_to_kg_json.py, we may not want to use biolink:same_as as the relation between an NCBIGene and a miRbase record. If the miRbase record is describing a microRNA, it might be better to use biolink:has_gene_product with the NCBIGene as "subject" and the miRbase microRNA node as the "object".

https://github.com/RTXteam/RTX/blob/525c362b0a9fb4dcf82ad8f025a0e3fd0936bbd5/code/kg2/mirbase_dat_to_kg_json.py#L169

ecwood commented 3 years ago

That change should have no affect on KG2, since KG2 currently has no nodes in it with category biolink:MicroRNA (I've checked).

The original change to biolink:microRNA was made after KG2.5.2 was rolled out. Thus, there may be some nodes in KG2.6.0 from NCBIGene that have the category microRNA. (See https://github.com/RTXteam/RTX/issues/1220#issuecomment-801292107)

If the miRbase record is describing a microRNA, it might be better to use biolink:has_gene_product with the NCBIGene as "subject" and the miRbase microRNA node as the "object".

@saramsey Should we do the same for HGNC? Current edge example:

        {
            "id": "miRBase:MI0000060---biolink:same_as---HGNC:31476---identifiers_org_registry:mirbase",
            "negated": false,
            "object": "HGNC:31476",
            "provided_by": "identifiers_org_registry:mirbase",
            "publications": [],
            "publications_info": {},
            "relation": "biolink:same_as",
            "relation_label": "same_as",
            "subject": "miRBase:MI0000060",
            "update_date": null
        },

HGNC:31476: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:31476

saramsey commented 3 years ago

Thank you @ericawood for pointing this out.

Yes, I think so. At least in KG2.5.2, all HGNC nodes are of category biolink:Gene.

Screen Shot 2021-04-16 at 2 07 34 PM

So in the above example, if the node miRBase:MI0000060 has category biolink:MicroRNA, then the edge should be changed so that the subject is HGNC:31476, the object is miRBase:MI0000060, and the relation is biolink:has_gene_product.

saramsey commented 3 years ago

As a general rule, I think, if two nodes in KG2 have different categories, they should probably (ideally?) not share a biolink:same_as edge (I'm shuddering at the thought of someone running an empirical check for how many edges in KG2 violate that rule-of-thumb, however, LOL).

saramsey commented 3 years ago

I couldn't help myself. I ran

match (n)-[r:`biolink:same_as`]->(m) where n.category <> m.category return count(*)

There's uh, a lot of them. Like, over 200k. Sigh.

saramsey commented 3 years ago

For starters, looks like the DrugBank ETL is connecting a bunch of nodes of category MolecularEntity to node of category ChemicalSubstance:

match (n:`biolink:MolecularEntity`)-[r:`biolink:same_as`]->(m:`biolink:ChemicalSubstance`) where head(r.provided_by) = 'identifiers_org_registry:drugbank' and n.category <> m.category return count(*)

That might be worth checking out, and if you feel it is fixable, maybe opening an issue on.

ecwood commented 3 years ago

In KG2.7.1:

match (n {provided_by: 'identifiers_org_registry:ncbigene'}) where n.full_name starts with "microRNA" return n.category, count(n)
n.category count(n)
"biolink:Gene" 1915

This issue appears to be fixed (and probably has been fixed for a while).