RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
39 stars 8 forks source link

UniprotKB Conversion Fails #299

Closed ecwood closed 1 year ago

ecwood commented 1 year ago

While testing for https://github.com/RTXteam/RTX-KG2/issues/296, I found that uniprotkb_dat_to_json.py fails:

Have processed 1 million lines
  Number of records: 278
Have processed 2 million lines
  Number of records: 566
Have processed 3 million lines
  Number of records: 934
Have processed 4 million lines
  Number of records: 1138
Have processed 5 million lines
  Number of records: 1342
Have processed 6 million lines
  Number of records: 1456
Have processed 7 million lines
  Number of records: 1821
Have processed 8 million lines
  Number of records: 2328
Have processed 9 million lines
  Number of records: 2839
Have processed 10 million lines
  Number of records: 3269
Have processed 11 million lines
  Number of records: 3696
Have processed 12 million lines
  Number of records: 3934
Have processed 13 million lines
  Number of records: 4377
Have processed 14 million lines
  Number of records: 4483
Have processed 15 million lines
  Number of records: 4729
Have processed 16 million lines
  Number of records: 4977
Have processed 17 million lines
  Number of records: 5358
Have processed 18 million lines
  Number of records: 5680
Have processed 19 million lines
  Number of records: 6061
Have processed 20 million lines
  Number of records: 6533
Have processed 21 million lines
  Number of records: 6783
Have processed 22 million lines
  Number of records: 7153
Have processed 23 million lines
  Number of records: 7483
Have processed 24 million lines
  Number of records: 7688
Have processed 25 million lines
  Number of records: 8097
Have processed 26 million lines
  Number of records: 8403
Have processed 27 million lines
  Number of records: 8737
Have processed 28 million lines
  Number of records: 9028
Have processed 29 million lines
  Number of records: 9402
Have processed 30 million lines
  Number of records: 9657
Have processed 31 million lines
  Number of records: 10061
Have processed 32 million lines
  Number of records: 10569
Have processed 33 million lines
  Number of records: 10798
Have processed 34 million lines
  Number of records: 11123
Have processed 35 million lines
  Number of records: 11895
Have processed 36 million lines
  Number of records: 12078
Have processed 37 million lines
  Number of records: 12240
Have processed 38 million lines
  Number of records: 12508
Have processed 39 million lines
  Number of records: 12611
Have processed 40 million lines
  Number of records: 13110
Have processed 41 million lines
  Number of records: 13620
Have processed 42 million lines
  Number of records: 14285
Have processed 43 million lines
  Number of records: 14642
Have processed 44 million lines
  Number of records: 15263
Have processed 45 million lines
  Number of records: 15883
Have processed 46 million lines
  Number of records: 16572
Have processed 47 million lines
  Number of records: 16932
Have processed 48 million lines
  Number of records: 17491
Have processed 49 million lines
  Number of records: 17771
Have processed 50 million lines
  Number of records: 17928
Have processed 51 million lines
  Number of records: 17996
Have processed 52 million lines
  Number of records: 18313
Have processed 53 million lines
  Number of records: 18759
Have processed 54 million lines
  Number of records: 18870
Have processed 55 million lines
  Number of records: 18888
Have processed 56 million lines
  Number of records: 18926
Have processed 57 million lines
  Number of records: 19192
Have processed 58 million lines
  Number of records: 19275
Have processed 59 million lines
  Number of records: 19302
Have processed 60 million lines
  Number of records: 19403
Have processed 61 million lines
  Number of records: 19843
Have processed 62 million lines
  Number of records: 20131
Have processed 63 million lines
  Number of records: 20724
Have processed 64 million lines
  Number of records: 21188
Have processed 65 million lines
  Number of records: 21384
Have processed 66 million lines
  Number of records: 21430
Have processed 67 million lines
  Number of records: 21888
Have processed 68 million lines
  Number of records: 22219
Have processed 69 million lines
  Number of records: 22806
Have processed 70 million lines
  Number of records: 23046
Have processed 71 million lines
  Number of records: 23588
Have processed 72 million lines
  Number of records: 24123
Have processed 73 million lines
  Number of records: 25453
Have processed 74 million lines
  Number of records: 25653
Have processed 75 million lines
  Number of records: 25659
Have processed 76 million lines
  Number of records: 25725
Have processed 77 million lines
  Number of records: 26232
Traceback (most recent call last):
  File "uniprotkb_dat_to_json.py", line 428, in <module>
    edges_list = make_edges(uniprot_records, nodes_dict)
  File "uniprotkb_dat_to_json.py", line 200, in make_edges
    assert len(m) < 2
AssertionError
ecwood commented 1 year ago

Here's the specific edge causing the error:

Familial hyperinsulinemic hypoglycemia 6 (HHF6) [MIM:606762]: Familial hyperinsulinemic hypoglycemia [MIM:256450], also referred to as congenital hyperinsulinism, nesidioblastosis, or persistent hyperinsulinemic hypoglycemia of infancy (PPHI), is the most common cause of persistent hypoglycemia in infancy and is due to defective negative feedback regulation of insulin secretion by low glucose levels. In HHF6 elevated oxidation rate of glutamate to alpha-ketoglutarate stimulates insulin secretion in the pancreatic beta cells, while they impair detoxification of ammonium in the liver. {ECO:0000269|PubMed:10636977, ECO:0000269|PubMed:11214910, ECO:0000269|PubMed:11297618, ECO:0000269|PubMed:9571255}. Note=The disease is caused by variants affecting the gene represented in this entry. 
['606762', '256450']
Traceback (most recent call last):
  File "uniprotkb_dat_to_json.py", line 430, in <module>
    edges_list = make_edges(uniprot_records, nodes_dict)
  File "uniprotkb_dat_to_json.py", line 202, in make_edges
    assert len(m) < 2
AssertionError
ecwood commented 1 year ago

This is caused by a recent commit: https://github.com/RTXteam/RTX-KG2/commit/b8c0a6d55a94feb1b8d2504edfda3b2b0f5887ef#diff-1d05c94e474b23597d281ef07ac150674142311af456e68786dbd58b36a8c29eR197-R201

ecwood commented 1 year ago

With 000db40, the code is no longer failing. However, I did reach out to Steve to confirm that this fix is appropriate, given the specificity of the original commit.

ecwood commented 1 year ago

This is directly related to #279, for tracking purposes.

ecwood commented 1 year ago

I am closing this issue because the code worked in KG2.8.4pre's build.