RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
39 stars 8 forks source link

DisGeNet edge with empty PMID #325

Closed saramsey closed 1 year ago

saramsey commented 1 year ago

In RTX-KG2.8.0.1c, I am seeing the following edge, which I think is not valid due to the "empty" CURIE for the PMID. The following Cypher query (on kg2canonicalized2.rtx.ai):

match (n {id: 'NCBIGene:3248'})-[r]->(m {id: 'MONDO:0006590'}) return r;

produces:

{
  "predicate": "biolink:gene_associated_with_condition",
  "primary_knowledge_source": "infores:disgenet",
  "publications_info": "{}",
  "kg2_ids": [
    "NCBIGene:3248---biolink:gene_associated_with_condition---None---None---None---UMLS:C4551675---DisGeNET:"
  ],
  "subject": "NCBIGene:3248",
  "id": "48222573",
  "object": "MONDO:0006590",
  "publications": [
    "PMID:"
  ]
}

I don't think the PMID: is permitted. This issue was first brought to my attention by the Translator Feedback group (https://github.com/NCATSTranslator/Feedback/issues/414)

saramsey commented 1 year ago

Confirmed, this same problem is showing up in KG2.8.3c (confirmed on kg2canonicalized.rtx.ai).

saramsey commented 1 year ago

I wonder if we just need to add an if pmid != '' block here:

https://github.com/RTXteam/RTX-KG2/blob/cb0fca6056a335ba97c513ce356889cda7039757/disgenet_tsv_to_kg_json.py#L84

saramsey commented 1 year ago

Another problematic edge (reported by Kaiwen He from Team Unsecret Agent):

NCBIGene:79092---biolink:gene_associated_with_condition---None---None---None---UMLS:C2930842---DisGeNET:    False   UMLS:C2930842   biolink:gene_associated_with_condition  gene_associated_with_condition  infores:disgenet    PMID:   {}              gene_associated_with_condition  biolink:gene_associated_with_condition  NCBIGene:79092  2018    biolink:gene_associated_with_condition  NCBIGene:79092  UMLS:C2930842
ecwood commented 1 year ago

It looks like this issue is fixed in KG2.8.4pre, so I am going to close out this issue:

match (n)-[e]-(m) where e.id="NCBIGene:79092---biolink:gene_associated_with_condition---None---None---None---UMLS:C2930842---DisGeNET:" return n, e, m limit 1
{
  "predicate": "biolink:gene_associated_with_condition",
  "domain_range_exclusion": "False",
  "negated": "False",
  "primary_knowledge_source": "infores:disgenet",
  "relation_label": "gene_associated_with_condition",
  "publications_info": "{}",
  "subject": "NCBIGene:79092",
  "source_predicate": "biolink:gene_associated_with_condition",
  "predicate_label": "gene_associated_with_condition",
  "id": "NCBIGene:79092---biolink:gene_associated_with_condition---None---None---None---UMLS:C2930842---DisGeNET:",
  "update_date": "2018",
  "object": "UMLS:C2930842"
}