RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

KGNodeIndex.get_equivalent_curies() should function symmetrically #825

Closed amykglen closed 4 years ago

amykglen commented 4 years ago

I noticed some interesting behavior when using KGNodeIndex.get_equivalent_curies()... specifically regarding the returned synonyms for these two curies:

kgni.get_equivalent_curies(curie='UniProtKB:Q13330', kg_name='KG1') --> ['UniProtKB:Q13330']

kgni.get_equivalent_curies(curie='UniProtKB:Q9BRL8', kg_name='KG1') --> ['UniProtKB:Q9BRL8', 'UniProtKB:Q13330']

I'm wondering why UniProtKB:Q9BRL8's synonyms include UniProtKB:Q13330, but UniProtKB:Q13330's synonyms don't include UniProtKB:Q9BRL8?

I was operating under the assumption that synonyms should be symmetrical in this sense... but maybe that's not correct?

edeutsch commented 4 years ago

Apparently not. But you're right that it should be. I'm in the process of tearing aparnt KGNodeIndex and trying to include the SRI NodeNormalizer. It's rather messy. Hopefully I'll have a better system put together soon. I'll use this as a check, thanks.

amykglen commented 4 years ago

Found another instance of this: DOID:9281's synonyms include OMIM:261600, but OMIM:261600's synonyms do not include DOID:9281:

Using KGNodeIndex.get_equivalent_curies() with kg_name='KG2':

DOID:9281's synonyms are:

['DOID:9281', 'OMIM:261600', 'NCI_NICHD:C81315', 'CUI:C0031485', 'MEDDRA:10034872', 'NCIT:C81315', 'MEDCIN:32390', 'LNC:LP56980-3', 'NCI_NCI-GLOSS:CDR0000446806', 'MONDO:0009861', 'Orphanet:716', 'REACT:R-HSA-2160456', 'LNC:LA21169-0', 'MEDLINEPLUS:1231']

OMIM:261600's synonyms are:

['OMIM:261600', 'NCI_NCI-GLOSS:CDR0000458041', 'MEDDRA:10035118']

(Noticing such discrepancies because they make deduplication difficult in expand #823)

edeutsch commented 4 years ago

Apologies for the slowness on my part on this. The existing code has a design flaw in this respect. I have essentially rewritten the KGNodeIndex since its role has now ballooned far beyond its original scope. It has now become a credible node synonymizer. It provides an interface to the SRI node normlizer but then also provides its own, which I think will be far more comprehensive and useful in our reasoning. I'm pretty happy with how it builds the index for KG1 now. Building of the index or KG2 is in progress. But since it try to include information from the SRI node normalizer via web service calls, it is extremely slow the first time. I hope I'll have something for you to start using on Monday. But see also #861 for design questions.

amykglen commented 4 years ago

No worries, sounds good! (The format/contents of the example output you posted in #861 looks awesome by the way - think that will work really nicely for expand's needs.)

edeutsch commented 4 years ago

The build process is still chugging away, sadly.. still only 45% through the file. Working through all the NCIT terms now.. This will likely take several more days, but it is progressing..

amykglen commented 4 years ago

found another instance of this that could serve as a test case (thanks to @chunyuma):

equivalent curies for UniProtKB:P06724 are: ['UniProtKB:P06724']
equivalent curies for UniProtKB:P30518 are: ['UniProtKB:P06724', 'HGNC:897', 'Orphanet:118947', 'UniProtKB:P30518', 'PR:P30518', 'NCI_NCI-HGNC:HGNC%3A897', 'OMIM:300538', 'NCBIGene:554', 'CUI:C1332124']
edeutsch commented 4 years ago

@amykglen I have checked in the new NodeSynonymizer. I think the build process it pretty robust. There are a few tweaks I still want to make, but maybe ready for your testing. I have not made the time to fully update the user interface yet, so that is spotty. I will try to fix soon. Let me know which methods I should prioritize. The one new method get_normalizer_results() is fully functional and essentially dumps a full listing of all information it knows for a concept. You could use that and just pull out what you want. Or if you want to use one of the simpler methods, let me know which ones I should fix up first. The NodeSynonymizer is in a new place under ARAX. The old KGNodeIndex will be phased out and retired. but will be available still for a while.

how to build: git pull (master)

If your NodeNameDescriptions files are not already up to date, you should first do: cd $RTX/data/KGmetadata python3 dumpdata.py

cd $RTX/code/ARAX/NodeSynonymizer python3 sri_node_normalizer.py --build python3 node_synonymizer.py --build --kg_name=both python3 node_synonymizer.py --lookup=rickets --kg_name=KG1

One possible snag is that the build process needs 20GB of free RAM to work. If that's not an option, then you could probably just copy the SQLite database from: /mnt/data/orangeboard/devED/RTX/code/ARAX/NodeSynonymizer/node_synonymizer.sqlite ?

Let me know how it goes. It's not really finished as I said, but hopefully functional enough to start your testing?

amykglen commented 4 years ago

awesome! yep, was able to get going and begin trying it out/integrating it. (opted to download the database.)

there seem to be quite a few curies that get_normalizer_results(curie, kg_name="KG2") errors out for:

  - 2020-07-03 17:47:06.619865 ERROR: Problem using NodeNormalizer. Input curie was CHEMBL.COMPOUND:CHEMBL564829: Traceback (most recent call last):
  File "/Users/aglen/translator/RTX_new/RTX/code/ARAX/test/../ARAXQuery/Expand/expand_utilities.py", line 126, in get_preferred_curie
    normalizer_results = node_synonymizer.get_normalizer_results(curie, kg_name="KG2")
  File "/Users/aglen/translator/RTX_new/RTX/code/ARAX/ARAXQuery/Expand/../../NodeSynonymizer/node_synonymizer.py", line 1222, in get_normalizer_results
    types[rows[3]] = 1
IndexError: list index out of range

some more examples of input curies it throws this error for are: CUI:C0078939, HP:0008936, OMIM:120080, PR:000003507, OMIM:610269...

edeutsch commented 4 years ago

fixed typo and pushed to master. Should be fixed now. Let me know if you still have problems

amykglen commented 4 years ago

nice, thanks - things seem to be running pretty well now - all expand pytests are passing, and the level of synonymization/deduplication appears to be much greater than with the prior KGNodeIndex method.

there are a few curies that get_normalizer_results(curie, kg_name="KG2") doesn't find results for... which isn't a breaking issue - I have expand log a warning in such cases and revert to using the original curie (so some synonymization/deduplication may be missed). but here are the ones I'm aware of so far:

{'OBO:SCTID_21061004', 'OBO:GARD_0005878', 'CUI:C3826682', 'OBO:COHD_439730', 'CUI:C3826681', 'OBO:ICD10_B60.0', 'OBO:ICD9_088.82', 'CUI:C3826680'}

all of these are curies in KG2. and actually, it makes sense the synonymizer can't find results for them, because it looks like they don't have names:

Screen Shot 2020-07-04 at 10 35 08 AM

so I guess that isn't really something that can be fixed in the synonymizer.

amykglen commented 4 years ago

as far as the interface goes, it's perfectly fine as is - I'm able to grab what I need. but here's what I think would be ideal for the synonymization half of the problem:

amykglen commented 4 years ago

and fyi, @edeutsch - the running times I'm seeing for get_normalizer_results() are not super fast, which is slowing expand down a little bit, due to its reliance on the method. I'm sending curies in batches as that definitely is faster.

here are some average times I'm seeing for get_normalizer_results():

batch size (# of curies) average time (seconds)
2070 7.2
1119 6.5
995 8.2
722 4.0
252 1.6

(these are with kg_name="KG2".)

(expand calls get_normalizer_results() about 3 times per edge, with varying batch size.)

edeutsch commented 4 years ago

I'm certain I can make it faster. This method returns a bunch of things doing several queries all in a loop. What's the minimum information you want for this? a list of equivalent nodes? or? I can also make this method faster, but since it does several things, that will be harder than focusing on just what you want back.

edeutsch commented 4 years ago

oh oops, you already answered that question two posts up. Got it.

amykglen commented 4 years ago

cool - yeah, I really only have two rather specific uses cases (used at separate points in expand):

  1. get a list/set of equivalent curies for an input curie (which I described above)
  2. get the 'preferred curie' and 'preferred name' for an input curie

so it'd be awesome if there were a method for each of those.

edeutsch commented 4 years ago

I just pushed a new version of get_equivalent_curies() to master. Test with: python node_synonymizer.py --get=DOID:384,DOID:13636,DOID:9281,DOID:99x --kg_name=KG2

It doesn't have everything you wanted. And the output format is different. Feel free to suggested changes that would make it better for you.

edeutsch commented 4 years ago

Regarding this somewhat older post:

there are a few curies that get_normalizer_results(curie, kg_name="KG2") doesn't find results for... which isn't a breaking issue - I have expand log a warning in such cases and revert to using the original curie (so some synonymization/deduplication may be missed). but here are the ones I'm aware of so far:

{'OBO:SCTID_21061004', 'OBO:GARD_0005878', 'CUI:C3826682', 'OBO:COHD_439730', 'CUI:C3826681', 'OBO:ICD10_B60.0', 'OBO:ICD9_088.82', 'CUI:C3826680'}

all of these are curies in KG2. and actually, it makes sense the synonymizer can't find results for them, because it looks like they don't have names:

Screen Shot 2020-07-04 at 10 35 08 AM

so I guess that isn't really something that can be fixed in the synonymizer.

I think the problem is different. The NodeSynonymizer does not find them because they are not in my view of KG2:

grep -i 'CUI:C3826682' NodeNamesDescriptions_KG2.tsv
grep -i 'OBO:COHD_439730' NodeNamesDescriptions_KG2.tsv
grep -i 'CUI:C3826680' NodeNamesDescriptions_KG2.tsv

all return nothing. As far as I can tell, these concepts are not in KG2. (at least they are not in the dump of some KG2 as produced by $RTX/node/data/KGmetadata/dumpdata.py

It seems there are multiple versions of KG2. Do we have some kind of versioning system for KG2 that we might leverage?

amykglen commented 4 years ago

I don't think this is a KG2 version issue - my NodeNamesDescriptions_KG2.tsv also doesn't contain these curies, but I just rebuilt them a couple days ago (and dumpdata.py pulls directly from the 'production' kg2endpoint2).

it looks to me like dumpdata.py doesn't add nodes to NodeNamesDescriptions_KG2.tsv if they're missing a 'name': https://github.com/RTXteam/RTX/blob/4ea641cdff3d7cbeaf58ea476d6f7f564ccaf63e/data/KGmetadata/dumpdata.py#L36

edeutsch commented 4 years ago

aha, okay, good point. I can fix dumpdata.py. And I can fix NodeSynonymizer to act reasonably if there is an empty name. Included in that is to make sure that all those nodes with no name don't form a joint list of synonyms!

But, it won't be able to much sensible with them. It could look at SRI, and it's possible that there might be something there, but seems doubtful. otherwise, since there's no name, there's no way to cluster them with anything else, so they would be orphans anyway. The only thing that will change is that they will be acknowledged to be a node. But not in any group.

I can do that, but it doesn't seem too useful? Is there a fix at the KG2 level that could/should happen first?

amykglen commented 4 years ago

yeah, not sure how useful that would be. it looks like about 2% of KG2 nodes (170k) are lacking a name at the moment. I'm not sure how much of that is a bug though (i.e., sometimes concepts are actually nameless, I believe?)

Screen Shot 2020-07-06 at 6 09 28 PM

edeutsch commented 4 years ago

I've heard of a horse with no name, and places where the streets have no name. But I don't think chemical_substances with no name will help us. And unknown_categories with no name seem even less useful. Seems like a bug to me. I think I will not make any changes to dumpdata.py until we're more confident that these will be useful to index.

amykglen commented 4 years ago

sounds good to me!

tried out the new get_equivalent_curies() by the way, and indeed it's much faster! thank you.

there is one test it's giving me an error for - here's the list of curies I give it:

['SNOMEDCT:23450008', 'CUI:C1532045', 'LNC:31245-4', 'CUI:C0013722', 'HP:0000980', 'CUI:C0034414', 'HP:0001399', 'LNC:21089-8', 'LNC:10647-6', 'CUI:C0085329', 'SNOMEDCT:4284001', 'CUI:C0325182', 'CUI:C0323464', 'CUI:C0004358', 'CUI:C0033684', 'NCIT:C34953', 'HP:0003365', 'CUI:C0030499', 'CUI:C0241407', 'CUI:C0450442', 'CUI:C0001314', 'CUI:C0128897', 'CUI:C0033740', 'LNC:82747-7', 'CUI:C0747256', 'CUI:C1334043', 'CUI:C0683954', 'CUI:C0035448', 'CUI:C0013467', 'CUI:C1519885', 'HP:0000789', 'CUI:C0003063', 'CUI:C0325185', 'CUI:C1123019', 'CUI:C0001779', 'CUI:C1449559', 'CUI:C0021088', 'CUI:C0696113', 'CUI:C1293116', 'CUI:C0392895', 'CUI:C0850715', 'MESH:D011529', 'CUI:C0030362', 'LNC:16118-2', 'CUI:C0040808', 'CUI:C0337527', 'HP:0002829', 'CUI:C0007753', 'CUI:C0013162', 'HP:0001000', 'CUI:C0206160', 'CUI:C0011065', 'CUI:C0003811', 'CUI:C0702091', 'SNOMEDCT:455000', 'CUI:C0424786', 'CUI:C1261322', 'CUI:C1254373', 'CUI:C1136169', 'CUI:C0027769', 'CUI:C0206253', 'NCIT:C66830', 'CUI:C0324939', 'CUI:C0162318', 'CUI:C0012854', 'CUI:C0024530', 'CUI:C0312452', 'SNOMEDCT:30408003', 'CUI:C0038002', 'CUI:C0030863', 'CUI:C0014412', 'CUI:C0024228', 'CUI:C0008051', 'CUI:C0325628', 'MEDLINEPLUS:1645', 'SNOMEDCT:415981009', 'CUI:C0018935', 'CUI:C0013218', 'HP:0001974', 'CUI:C0237798', 'SNOMEDCT:66818003', 'CUI:C0029237', 'CUI:C1136254', 'LNC:43893-7', 'CUI:C3826680', 'CUI:C0002797', 'CUI:C0325087', 'CUI:C1510418', 'CUI:C0002880', 'CUI:C0457437', 'LNC:47396-7', 'CUI:C0028737', 'HP:0001635', 'NCIT:C128453', 'NCBIGene:959', 'CUI:C0052430', 'HP:0001324', 'CUI:C0445623', 'LNC:6311-5', 'HP:0001903', 'LNC:10648-4', 'CUI:C0039005', 'HP:0002910', 'CUI:C0006104', 'CUI:C0334901', 'CUI:C0027121', 'MEDCIN:278147', 'CUI:C0032149', 'CUI:C0029122', 'CUI:C0002712', 'CUI:C0031809', 'OBO:ICD9_088.82', 'CUI:C1947990', 'CUI:C0015967', 'CUI:C1510438', 'LNC:22853-6', 'CUI:C0026022', 'CUI:C0185125', 'SNOMEDCT:21852000', 'HP:0001289', 'HP:0003573', 'CUI:C1511790', 'CUI:C0034422', 'HP:0002721', 'CUI:C0019048', 'HP:0002039', 'CUI:C0691786', 'LNC:16117-4', 'HP:0001972', 'HP:0001941', 'CUI:C0003338', 'CUI:C0026336', 'CUI:C0070533', 'CUI:C0034500', 'CUI:C0087111', 'CUI:C1265549', 'CUI:C0021740', 'CUI:C0016286', 'LNC:22858-5', 'LNC:23665-3', 'SNOMEDCT:418101009', 'CUI:C0392920', 'SNOMEDCT:50274008', 'CUI:C0037995', 'CUI:C0324296', 'CUI:C0079603', 'CUI:C0879626', 'CUI:C0060495', 'LNC:23662-0', 'HP:0000952', 'CUI:C0085328', 'CUI:C0020928', 'CUI:C0006352', 'CUI:C0058282', 'NCBITaxon:5865', 'CUI:C0005779', 'CUI:C0033147', 'CUI:C0036690', 'CUI:C0040165', 'LNC:7813-9', 'CUI:C0001675', 'CUI:C0260095', 'CUI:C0333547', 'LNC:43086-8', 'CUI:C0598741', 'CUI:C0013018', 'CUI:C0061928', 'CUI:C0849679', 'CUI:C0023416', 'CUI:C2345908', 'HP:0001923', 'OBO:GARD_0005878', 'CUI:C0318342', 'HP:0000969', 'NCIT:C61410', 'CUI:C0743841', 'CUI:C0012984', 'CUI:C0999544', 'HP:0002527', 'CUI:C0003062', 'CUI:C0085327', 'CUI:C0007452', 'SNOMEDCT:38391004', 'CUI:C0313107', 'CUI:C0325005', 'CUI:C0321644', 'CUI:C0085393', 'CUI:C0856169', 'CUI:C0552665', 'HP:0003073', 'CUI:C0599755', 'CUI:C0005800', 'CUI:C0233481', 'HP:0001882', 'CUI:C0162700', 'CUI:C0040203', 'ICD10:B60', 'HP:0000975', 'CUI:C0231189', 'CUI:C0003898', 'CUI:C0486382', 'CUI:C0320813', 'CUI:C0030842', 'HP:0001944', 'CUI:C1295927', 'LNC:82748-5', 'CUI:C0949216', 'CUI:C0276852', 'CUI:C1167395', 'CUI:C0022346', 'CUI:C0001655', 'LNC:16427-7', 'OBO:SCTID_21061004', 'SNOMEDCT:17101005', 'CUI:C0024141', 'CUI:C0318329', 'CUI:C0023882', 'CUI:C0325319', 'LNC:43918-2', 'CUI:C0275524', 'CUI:C0024544', 'CUI:C0946608', 'CUI:C0562691', 'CUI:C0383327', 'CUI:C0002871', 'CUI:C0031831', 'CUI:C0040669', 'SNOMEDCT:17800008', 'LNC:43926-5', 'CUI:C0339510', 'CUI:C0323438', 'SNOMEDCT:22405002', 'LNC:67866-4', 'CUI:C0486383', 'CUI:C0392318', 'CUI:C1540912', 'HP:0001433', 'CUI:C0872054', 'CUI:C0026766', 'CUI:C0021294', 'LNC:88451-0', 'CUI:C0009429', 'CUI:C0325003', 'LNC:43087-6', 'HP:0100598', 'CUI:C0004366', 'CUI:C0325253', 'CUI:C0282509', 'CUI:C0013090', 'CUI:C0483368', 'CUI:C0142025', 'LNC:31246-2', 'CUI:C0368726', 'CUI:C0041213', 'SNOMEDCT:26114002', 'HP:0002157', 'CUI:C0282647', 'CUI:C0027361', 'HP:0000718', 'HP:0000099', 'CUI:C0040034', 'CUI:C0013798', 'CUI:C0007634', 'HP:0002013', 'HP:0100724', 'HP:0001895', 'MEDCIN:278146', 'CUI:C0150270', 'CUI:C0376387', 'CUI:C0037813', 'CUI:C0456388', 'CUI:C0085326', 'CUI:C0030660', 'CUI:C0020649', 'CUI:C0012222', 'CUI:C0036743', 'CUI:C3826682', 'LNC:22844-5', 'CUI:C0323499', 'CUI:C0010240', 'LNC:88452-8', 'CUI:C0323465', 'CUI:C0036945', 'CUI:C0013216', 'CUI:C0027567', 'HP:0002383', 'CUI:C0037998', 'CUI:C0035950', 'LNC:22857-7', 'CUI:C0679646', 'SNOMEDCT:43574002', 'CUI:C0684073', 'CUI:C0324818', 'CUI:C0021270', 'Orphanet:108', 'CUI:C0027061', 'CUI:C1510458', 'CUI:C0939219', 'HP:0002719', 'CUI:C1289877', 'MESH:D016792', 'CUI:C0079186', 'CUI:C0162699', 'CUI:C0277564', 'HP:0001876', 'CUI:C0024660', 'CUI:C0231224', 'CUI:C0314622', 'CUI:C0029039', 'LNC:22107-7', 'SNOMEDCT:105652001', 'LNC:47071-6', 'HP:0001824', 'CUI:C0320818', 'CUI:C0035899', 'HP:0001875', 'HP:0001744', 'CUI:C0123759', 'HP:0002315', 'CUI:C0368720', 'LNC:89342-0', 'CUI:C3826681', 'CUI:C1313951', 'CUI:C0026809', 'HP:0001376', 'CUI:C0003320', 'CUI:C0003416', 'CUI:C1332714', 'CUI:C0021742', 'CUI:C0011133', 'CUI:C0011596', 'CUI:C0558024', 'CUI:C0019054', 'CUI:C0030498', 'CUI:C0013227', 'NCIT:C35803', 'CUI:C0035222', 'CUI:C0878544', 'CUI:C0033739', 'CUI:C0320810', 'CUI:C0035078', 'LNC:9585-1', 'CUI:C0238644', 'CUI:C0023418', 'CUI:C0008996', 'CUI:C0034417', 'HP:0001254', 'CUI:C0020971', 'HP:0001658', 'CUI:C0026447', 'HP:0100608', 'LNC:24408-7', 'CUI:C0324145', 'CUI:C2733204', 'HP:0000613', 'CUI:C0003232', 'CUI:C0018270', 'CUI:C0497093', 'LNC:41414-4', 'CUI:C0313532', 'CUI:C0521829', 'CUI:C0324323', 'HP:0002017', 'CUI:C0868945', 'SNOMEDCT:415983007', 'CUI:C0301872', 'CUI:C0242723', 'CUI:C0324996', 'CUI:C0999244', 'LNC:41415-1', 'LNC:27965-3', 'HP:0001259', 'LNC:LP14081-1', 'CUI:C0301838', 'CUI:C0030312', 'SNOMEDCT:24620004', 'CUI:C0012940', 'CUI:C0562690', 'LNC:88700-0', 'CUI:C0019944', 'CUI:C0746336', 'CUI:C0165603', 'CUI:C0039082', 'LNC:22846-0', 'LNC:20689-6', 'CUI:C0023364', 'HP:0002615', 'CUI:C0018019', 'CUI:C0025914', 'LNC:7812-1', 'CUI:C0199960', 'CUI:C0324376', 'CUI:C1532044', 'HP:0004936', 'LNC:22847-8', 'LNC:5054-2', 'HP:0003326', 'CUI:C0011946', 'LNC:22108-5', 'CUI:C0001792', 'CUI:C0320811', 'CUI:C0085316', 'SNOMEDCT:442614005', 'CUI:C0015236', 'CUI:C0008947', 'MONDO:0002428', 'CUI:C0018557', 'CUI:C1457887', 'NCIT:C85491', 'CUI:C1504080', 'CUI:C0242966', 'HP:0000083', 'CUI:C0221460', 'CUI:C0003392', 'NCIT:C90259', 'CUI:C0026976', 'HP:0001945', 'MESH:D041001', 'HP:0003259', 'CUI:C0626053', 'CUI:C0026018', 'CUI:C0325312', 'CUI:C0080332', 'NCIT:C77916', 'CUI:C0325273', 'CUI:C0005767', 'CUI:C0242606', 'CUI:C0449411', 'CUI:C0034865', 'CUI:C0036983', 'LNC:23666-1', 'HP:0001919', 'CUI:C0019116', 'CUI:C0008269', 'MEDCIN:90190', 'CUI:C1532042', 'HP:0005521', 'CUI:C0027362', 'CUI:C1263440', 'CUI:C0320812', 'HP:0003641', 'SNOMEDCT:415980005', 'CUI:C0656383', 'CUI:C0948192', 'CUI:C0272126', 'CUI:C0033741', 'CUI:C0368725', 'LNC:67867-2', 'HP:0012735', 'CUI:C0323512', 'CUI:C0200931', 'CUI:C0585165', 'CUI:C0030705', 'CUI:C0025937', 'CUI:C0019993', 'CUI:C0776499', 'LNC:9584-4', 'CUI:C0026848', 'HP:0002093', 'LNC:22106-9', 'CUI:C0020615', 'CUI:C0004574', 'LNC:31244-7', 'CUI:C1297876', 'CUI:C0086252', 'CUI:C0052796', 'CUI:C0004398', 'NCIT:C27864', 'CUI:C0323517', 'CUI:C0014310', 'CUI:C0580205', 'CUI:C0039194', 'CUI:C0699748', 'CUI:C0007450', 'CUI:C0175923', 'CUI:C0325331', 'CUI:C0051200', 'CUI:C0376568', 'CUI:C0014792', 'LNC:41413-6', 'CUI:C0017725', 'CUI:C1285186', 'CUI:C0007028', 'CUI:C0276846', 'CUI:C0320842', 'CUI:C0702166', 'CUI:C0007570', 'CUI:C0004573', 'CUI:C1263988', 'CUI:C0002878', 'CUI:C0039753', 'CUI:C0043528', 'SNOMEDCT:61370009', 'CUI:C0009326', 'CUI:C0040549', 'CUI:C0325051', 'LNC:23663-8', 'HP:0001878', 'SNOMEDCT:65294004', 'CUI:C0325224', 'CUI:C0012634', 'CUI:C0026249', 'CUI:C0003862', 'SNOMEDCT:1342005', 'SNOMEDCT:53253006', 'CUI:C0184661', 'CUI:C0016875', 'LNC:16426-9', 'CUI:C1273870', 'HP:0012378', 'CUI:C0456386', 'LNC:87547-6', 'CUI:C0085325', 'LNC:88233-2', 'CUI:C0003402', 'CUI:C0042210', 'HP:0100776', 'CUI:C0021289', 'CUI:C1532043', 'CUI:C0042196', 'CUI:C0003420', 'CUI:C0005791', 'CUI:C0276854', 'CUI:C0162326', 'LNC:54217-5', 'HP:0010783', 'HP:0001864', 'CUI:C0871685', 'CUI:C0040558', 'NCIT:C122179', 'CUI:C0320819', 'OBO:ICD10_B60.0', 'CUI:C1511501', 'CUI:C0324306', 'CUI:C0032520', 'CUI:C0009541', 'CUI:C0325216', 'CUI:C1539081', 'CUI:C0032148', 'CUI:C0063393', 'CUI:C0949466', 'LNC:88450-2', 'CUI:C0552664', 'CUI:C0599779', 'CUI:C0008059', 'CUI:C0042211', 'CUI:C0058099', 'CUI:C0038038', 'CUI:C0006035', 'CUI:C0022658', 'CUI:C0199176', 'SNOMEDCT:397072005', 'CUI:C0035804', 'SNOMEDCT:64950006', 'CUI:C1297409', 'CUI:C0040649', 'MEDDRA:10002067', 'CUI:C0282510', 'OBO:COHD_439730', 'CUI:C0324180', 'CUI:C0003316', 'NCBITaxon:32594', 'CUI:C0020268', 'SNOMEDCT:1102005', 'CUI:C0277785', 'SNOMEDCT:76828008', 'MONDO:0002009', 'CUI:C0023281', 'CUI:C0320830', 'CUI:C0549634', 'HP:0001943', 'CUI:C0024115', 'CUI:C0003064', 'NCIT:C122180', 'SNOMEDCT:38602006', 'SNOMEDCT:415979007', 'HP:0002098', 'CUI:C0325222', 'CUI:C0024109', 'CUI:C0199470', 'CUI:C0041942', 'LNC:22849-4', 'CUI:C0023358', 'CUI:C0597305', 'SNOMEDCT:608923007', 'CUI:C0376261', 'CUI:C0024282', 'CUI:C0011315', 'CUI:C0323406', 'CUI:C0030054', 'CUI:C0679818', 'CUI:C0035956', 'LNC:22848-6', 'LNC:22850-2', 'CUI:C0430054', 'MESH:D017282', 'CUI:C0175925', 'CUI:C0040196', 'LNC:88728-1', 'CUI:C0001047', 'HP:0000093', 'CUI:C0023756', 'CUI:C0276848', 'CUI:C0017462', 'CUI:C0002895', 'CUI:C1292533', 'CUI:C0999517', 'CUI:C0005841', 'CUI:C0011911', 'CUI:C0948145', 'MEDDRA:10003964', 'LNC:42581-9', 'CUI:C0007018', 'CUI:C0043210', 'CUI:C0025252', 'CUI:C0086418', 'HP:0100827', 'CUI:C0024291', 'HP:0002908', 'LNC:16425-1', 'CUI:C0325174', 'CUI:C0006034', 'CUI:C0035005', 'HP:0002240', 'CUI:C0320827', 'MONDO:0021136', 'CUI:C0733470', 'CUI:C0024198', 'CUI:C0070129', 'CUI:C0455014', 'CUI:C0023779', 'CUI:C0031268', 'CUI:C0220908', 'CUI:C0323515', 'LNC:22845-2', 'CUI:C0063413', 'CUI:C0024400', 'LNC:22104-4', 'LNC:26622-1', 'LNC:60521-2', 'CUI:C0325001', 'ICD10CM:B60', 'CUI:C0687759', 'LNC:43085-0', 'CUI:C1482264', 'LNC:22854-4', 'CUI:C0948202', 'CUI:C0272286', 'CUI:C0003241', 'NCBITaxon:5866', 'CUI:C0318328', 'CUI:C0036974', 'HP:0001873', 'CUI:C0325175', 'CUI:C0025266', 'CUI:C0015970', 'CUI:C0323454', 'CUI:C0011777', 'LNC:42580-1', 'LNC:10347-3', 'CUI:C0033477', 'CUI:C0003460', 'LNC:LA17804-8', 'CUI:C0014441', 'CUI:C0320816', 'CUI:C0042567', 'CUI:C0034693', 'CUI:C0004368', 'MESH:D016793', 'CUI:C0070532', 'CUI:C0320821', 'UniProtKB:Q6UXR4', 'CUI:C1511661', 'LNC:88461-9', 'CUI:C0018561', 'CUI:C0009932', 'LNC:47073-2', 'HP:0000716', 'LNC:22851-0', 'CUI:C0311392', 'CUI:C0585171', 'LNC:23664-6', 'CUI:C0012860', 'CUI:C0324183', 'LNC:34940-7', 'SNOMEDCT:415982002', 'CUI:C0728940', 'CUI:C0009017', 'CUI:C0243077', 'HP:0001973', 'CUI:C0344211', 'CUI:C0000934', 'LNC:22856-9', 'DOID:2789', 'CUI:C0009676', 'CUI:C0132172', 'SNOMEDCT:112420006', 'HP:0001888', 'CUI:C0237401', 'CUI:C0029235', 'LNC:42641-1', 'CUI:C0037993', 'CUI:C0021368', 'CUI:C0323435', 'CUI:C0036055', 'CUI:C0320843', 'CUI:C0053355', 'CUI:C1444783', 'CUI:C0086944', 'LNC:22855-1', 'CUI:C0009566', 'CUI:C0010418', 'CUI:C0051542', 'EFO:0001067', 'CUI:C0009450', 'CUI:C0086565', 'SNOMEDCT:32748003', 'CUI:C0011900', 'CUI:C0020964', 'CUI:C0006801', 'SNOMEDCT:86432002', 'SNOMEDCT:106615005']

and the error:

  - 2020-07-06 18:36:17.402073 ERROR: Encountered a problem using NodeSynonymizer: Traceback (most recent call last):
  File "/Users/aglen/translator/RTX_new/RTX/code/ARAX/test/../ARAXQuery/Expand/expand_utilities.py", line 218, in get_preferred_curies
    curie_list = node_synonymizer.get_equivalent_curies(curies, kg_name="KG2")
  File "/Users/aglen/translator/RTX_new/RTX/code/ARAX/ARAXQuery/Expand/../../NodeSynonymizer/node_synonymizer.py", line 1118, in get_equivalent_curies
    if results[row[0]] is None:
KeyError: 'ORPHANET:108'
edeutsch commented 4 years ago

Yeah, orphanet is a bit awkward because KG2 uses Orphanet: and SRI normalizer uses ORPHANET: I though I figured it out, but there is a bug in the build process I suppose. I put in a little patch that should fix this. Please try pulling from master