Closed amykglen closed 4 years ago
Apparently not. But you're right that it should be. I'm in the process of tearing aparnt KGNodeIndex and trying to include the SRI NodeNormalizer. It's rather messy. Hopefully I'll have a better system put together soon. I'll use this as a check, thanks.
Found another instance of this: DOID:9281's synonyms include OMIM:261600, but OMIM:261600's synonyms do not include DOID:9281:
Using KGNodeIndex.get_equivalent_curies()
with kg_name='KG2'
:
DOID:9281's synonyms are:
['DOID:9281', 'OMIM:261600', 'NCI_NICHD:C81315', 'CUI:C0031485', 'MEDDRA:10034872', 'NCIT:C81315', 'MEDCIN:32390', 'LNC:LP56980-3', 'NCI_NCI-GLOSS:CDR0000446806', 'MONDO:0009861', 'Orphanet:716', 'REACT:R-HSA-2160456', 'LNC:LA21169-0', 'MEDLINEPLUS:1231']
OMIM:261600's synonyms are:
['OMIM:261600', 'NCI_NCI-GLOSS:CDR0000458041', 'MEDDRA:10035118']
(Noticing such discrepancies because they make deduplication difficult in expand
#823)
Apologies for the slowness on my part on this. The existing code has a design flaw in this respect. I have essentially rewritten the KGNodeIndex since its role has now ballooned far beyond its original scope. It has now become a credible node synonymizer. It provides an interface to the SRI node normlizer but then also provides its own, which I think will be far more comprehensive and useful in our reasoning. I'm pretty happy with how it builds the index for KG1 now. Building of the index or KG2 is in progress. But since it try to include information from the SRI node normalizer via web service calls, it is extremely slow the first time. I hope I'll have something for you to start using on Monday. But see also #861 for design questions.
No worries, sounds good! (The format/contents of the example output you posted in #861 looks awesome by the way - think that will work really nicely for expand
's needs.)
The build process is still chugging away, sadly.. still only 45% through the file. Working through all the NCIT terms now.. This will likely take several more days, but it is progressing..
found another instance of this that could serve as a test case (thanks to @chunyuma):
equivalent curies for UniProtKB:P06724 are: ['UniProtKB:P06724']
equivalent curies for UniProtKB:P30518 are: ['UniProtKB:P06724', 'HGNC:897', 'Orphanet:118947', 'UniProtKB:P30518', 'PR:P30518', 'NCI_NCI-HGNC:HGNC%3A897', 'OMIM:300538', 'NCBIGene:554', 'CUI:C1332124']
@amykglen I have checked in the new NodeSynonymizer. I think the build process it pretty robust. There are a few tweaks I still want to make, but maybe ready for your testing. I have not made the time to fully update the user interface yet, so that is spotty. I will try to fix soon. Let me know which methods I should prioritize. The one new method get_normalizer_results() is fully functional and essentially dumps a full listing of all information it knows for a concept. You could use that and just pull out what you want. Or if you want to use one of the simpler methods, let me know which ones I should fix up first. The NodeSynonymizer is in a new place under ARAX. The old KGNodeIndex will be phased out and retired. but will be available still for a while.
how to build: git pull (master)
If your NodeNameDescriptions files are not already up to date, you should first do: cd $RTX/data/KGmetadata python3 dumpdata.py
cd $RTX/code/ARAX/NodeSynonymizer python3 sri_node_normalizer.py --build python3 node_synonymizer.py --build --kg_name=both python3 node_synonymizer.py --lookup=rickets --kg_name=KG1
One possible snag is that the build process needs 20GB of free RAM to work. If that's not an option, then you could probably just copy the SQLite database from: /mnt/data/orangeboard/devED/RTX/code/ARAX/NodeSynonymizer/node_synonymizer.sqlite ?
Let me know how it goes. It's not really finished as I said, but hopefully functional enough to start your testing?
awesome! yep, was able to get going and begin trying it out/integrating it. (opted to download the database.)
there seem to be quite a few curies that get_normalizer_results(curie, kg_name="KG2")
errors out for:
- 2020-07-03 17:47:06.619865 ERROR: Problem using NodeNormalizer. Input curie was CHEMBL.COMPOUND:CHEMBL564829: Traceback (most recent call last):
File "/Users/aglen/translator/RTX_new/RTX/code/ARAX/test/../ARAXQuery/Expand/expand_utilities.py", line 126, in get_preferred_curie
normalizer_results = node_synonymizer.get_normalizer_results(curie, kg_name="KG2")
File "/Users/aglen/translator/RTX_new/RTX/code/ARAX/ARAXQuery/Expand/../../NodeSynonymizer/node_synonymizer.py", line 1222, in get_normalizer_results
types[rows[3]] = 1
IndexError: list index out of range
some more examples of input curies it throws this error for are: CUI:C0078939, HP:0008936, OMIM:120080, PR:000003507, OMIM:610269...
fixed typo and pushed to master. Should be fixed now. Let me know if you still have problems
nice, thanks - things seem to be running pretty well now - all expand
pytests are passing, and the level of synonymization/deduplication appears to be much greater than with the prior KGNodeIndex
method.
there are a few curies that get_normalizer_results(curie, kg_name="KG2")
doesn't find results for... which isn't a breaking issue - I have expand
log a warning in such cases and revert to using the original curie (so some synonymization/deduplication may be missed). but here are the ones I'm aware of so far:
{'OBO:SCTID_21061004', 'OBO:GARD_0005878', 'CUI:C3826682', 'OBO:COHD_439730', 'CUI:C3826681', 'OBO:ICD10_B60.0', 'OBO:ICD9_088.82', 'CUI:C3826680'}
all of these are curies in KG2. and actually, it makes sense the synonymizer can't find results for them, because it looks like they don't have names:
so I guess that isn't really something that can be fixed in the synonymizer.
as far as the interface goes, it's perfectly fine as is - I'm able to grab what I need. but here's what I think would be ideal for the synonymization half of the problem:
get_equivalent_curies(curie, source)
that returns a set of equivalent curies
and fyi, @edeutsch - the running times I'm seeing for get_normalizer_results()
are not super fast, which is slowing expand
down a little bit, due to its reliance on the method. I'm sending curies in batches as that definitely is faster.
here are some average times I'm seeing for get_normalizer_results()
:
batch size (# of curies) | average time (seconds) |
---|---|
2070 | 7.2 |
1119 | 6.5 |
995 | 8.2 |
722 | 4.0 |
252 | 1.6 |
(these are with kg_name="KG2"
.)
(expand
calls get_normalizer_results()
about 3 times per edge, with varying batch size.)
I'm certain I can make it faster. This method returns a bunch of things doing several queries all in a loop. What's the minimum information you want for this? a list of equivalent nodes? or? I can also make this method faster, but since it does several things, that will be harder than focusing on just what you want back.
oh oops, you already answered that question two posts up. Got it.
cool - yeah, I really only have two rather specific uses cases (used at separate points in expand
):
so it'd be awesome if there were a method for each of those.
I just pushed a new version of get_equivalent_curies() to master. Test with: python node_synonymizer.py --get=DOID:384,DOID:13636,DOID:9281,DOID:99x --kg_name=KG2
It doesn't have everything you wanted. And the output format is different. Feel free to suggested changes that would make it better for you.
Regarding this somewhat older post:
there are a few curies that
get_normalizer_results(curie, kg_name="KG2")
doesn't find results for... which isn't a breaking issue - I haveexpand
log a warning in such cases and revert to using the original curie (so some synonymization/deduplication may be missed). but here are the ones I'm aware of so far:{'OBO:SCTID_21061004', 'OBO:GARD_0005878', 'CUI:C3826682', 'OBO:COHD_439730', 'CUI:C3826681', 'OBO:ICD10_B60.0', 'OBO:ICD9_088.82', 'CUI:C3826680'}
all of these are curies in KG2. and actually, it makes sense the synonymizer can't find results for them, because it looks like they don't have names:
so I guess that isn't really something that can be fixed in the synonymizer.
I think the problem is different. The NodeSynonymizer does not find them because they are not in my view of KG2:
grep -i 'CUI:C3826682' NodeNamesDescriptions_KG2.tsv
grep -i 'OBO:COHD_439730' NodeNamesDescriptions_KG2.tsv
grep -i 'CUI:C3826680' NodeNamesDescriptions_KG2.tsv
all return nothing. As far as I can tell, these concepts are not in KG2. (at least they are not in the dump of some KG2 as produced by $RTX/node/data/KGmetadata/dumpdata.py
It seems there are multiple versions of KG2. Do we have some kind of versioning system for KG2 that we might leverage?
I don't think this is a KG2 version issue - my NodeNamesDescriptions_KG2.tsv
also doesn't contain these curies, but I just rebuilt them a couple days ago (and dumpdata.py
pulls directly from the 'production' kg2endpoint2
).
it looks to me like dumpdata.py
doesn't add nodes to NodeNamesDescriptions_KG2.tsv
if they're missing a 'name': https://github.com/RTXteam/RTX/blob/4ea641cdff3d7cbeaf58ea476d6f7f564ccaf63e/data/KGmetadata/dumpdata.py#L36
aha, okay, good point. I can fix dumpdata.py. And I can fix NodeSynonymizer to act reasonably if there is an empty name. Included in that is to make sure that all those nodes with no name don't form a joint list of synonyms!
But, it won't be able to much sensible with them. It could look at SRI, and it's possible that there might be something there, but seems doubtful. otherwise, since there's no name, there's no way to cluster them with anything else, so they would be orphans anyway. The only thing that will change is that they will be acknowledged to be a node. But not in any group.
I can do that, but it doesn't seem too useful? Is there a fix at the KG2 level that could/should happen first?
yeah, not sure how useful that would be. it looks like about 2% of KG2 nodes (170k) are lacking a name at the moment. I'm not sure how much of that is a bug though (i.e., sometimes concepts are actually nameless, I believe?)
I've heard of a horse with no name, and places where the streets have no name. But I don't think chemical_substances with no name will help us. And unknown_categories with no name seem even less useful. Seems like a bug to me. I think I will not make any changes to dumpdata.py until we're more confident that these will be useful to index.
sounds good to me!
tried out the new get_equivalent_curies()
by the way, and indeed it's much faster! thank you.
there is one test it's giving me an error for - here's the list of curies I give it:
['SNOMEDCT:23450008', 'CUI:C1532045', 'LNC:31245-4', 'CUI:C0013722', 'HP:0000980', 'CUI:C0034414', 'HP:0001399', 'LNC:21089-8', 'LNC:10647-6', 'CUI:C0085329', 'SNOMEDCT:4284001', 'CUI:C0325182', 'CUI:C0323464', 'CUI:C0004358', 'CUI:C0033684', 'NCIT:C34953', 'HP:0003365', 'CUI:C0030499', 'CUI:C0241407', 'CUI:C0450442', 'CUI:C0001314', 'CUI:C0128897', 'CUI:C0033740', 'LNC:82747-7', 'CUI:C0747256', 'CUI:C1334043', 'CUI:C0683954', 'CUI:C0035448', 'CUI:C0013467', 'CUI:C1519885', 'HP:0000789', 'CUI:C0003063', 'CUI:C0325185', 'CUI:C1123019', 'CUI:C0001779', 'CUI:C1449559', 'CUI:C0021088', 'CUI:C0696113', 'CUI:C1293116', 'CUI:C0392895', 'CUI:C0850715', 'MESH:D011529', 'CUI:C0030362', 'LNC:16118-2', 'CUI:C0040808', 'CUI:C0337527', 'HP:0002829', 'CUI:C0007753', 'CUI:C0013162', 'HP:0001000', 'CUI:C0206160', 'CUI:C0011065', 'CUI:C0003811', 'CUI:C0702091', 'SNOMEDCT:455000', 'CUI:C0424786', 'CUI:C1261322', 'CUI:C1254373', 'CUI:C1136169', 'CUI:C0027769', 'CUI:C0206253', 'NCIT:C66830', 'CUI:C0324939', 'CUI:C0162318', 'CUI:C0012854', 'CUI:C0024530', 'CUI:C0312452', 'SNOMEDCT:30408003', 'CUI:C0038002', 'CUI:C0030863', 'CUI:C0014412', 'CUI:C0024228', 'CUI:C0008051', 'CUI:C0325628', 'MEDLINEPLUS:1645', 'SNOMEDCT:415981009', 'CUI:C0018935', 'CUI:C0013218', 'HP:0001974', 'CUI:C0237798', 'SNOMEDCT:66818003', 'CUI:C0029237', 'CUI:C1136254', 'LNC:43893-7', 'CUI:C3826680', 'CUI:C0002797', 'CUI:C0325087', 'CUI:C1510418', 'CUI:C0002880', 'CUI:C0457437', 'LNC:47396-7', 'CUI:C0028737', 'HP:0001635', 'NCIT:C128453', 'NCBIGene:959', 'CUI:C0052430', 'HP:0001324', 'CUI:C0445623', 'LNC:6311-5', 'HP:0001903', 'LNC:10648-4', 'CUI:C0039005', 'HP:0002910', 'CUI:C0006104', 'CUI:C0334901', 'CUI:C0027121', 'MEDCIN:278147', 'CUI:C0032149', 'CUI:C0029122', 'CUI:C0002712', 'CUI:C0031809', 'OBO:ICD9_088.82', 'CUI:C1947990', 'CUI:C0015967', 'CUI:C1510438', 'LNC:22853-6', 'CUI:C0026022', 'CUI:C0185125', 'SNOMEDCT:21852000', 'HP:0001289', 'HP:0003573', 'CUI:C1511790', 'CUI:C0034422', 'HP:0002721', 'CUI:C0019048', 'HP:0002039', 'CUI:C0691786', 'LNC:16117-4', 'HP:0001972', 'HP:0001941', 'CUI:C0003338', 'CUI:C0026336', 'CUI:C0070533', 'CUI:C0034500', 'CUI:C0087111', 'CUI:C1265549', 'CUI:C0021740', 'CUI:C0016286', 'LNC:22858-5', 'LNC:23665-3', 'SNOMEDCT:418101009', 'CUI:C0392920', 'SNOMEDCT:50274008', 'CUI:C0037995', 'CUI:C0324296', 'CUI:C0079603', 'CUI:C0879626', 'CUI:C0060495', 'LNC:23662-0', 'HP:0000952', 'CUI:C0085328', 'CUI:C0020928', 'CUI:C0006352', 'CUI:C0058282', 'NCBITaxon:5865', 'CUI:C0005779', 'CUI:C0033147', 'CUI:C0036690', 'CUI:C0040165', 'LNC:7813-9', 'CUI:C0001675', 'CUI:C0260095', 'CUI:C0333547', 'LNC:43086-8', 'CUI:C0598741', 'CUI:C0013018', 'CUI:C0061928', 'CUI:C0849679', 'CUI:C0023416', 'CUI:C2345908', 'HP:0001923', 'OBO:GARD_0005878', 'CUI:C0318342', 'HP:0000969', 'NCIT:C61410', 'CUI:C0743841', 'CUI:C0012984', 'CUI:C0999544', 'HP:0002527', 'CUI:C0003062', 'CUI:C0085327', 'CUI:C0007452', 'SNOMEDCT:38391004', 'CUI:C0313107', 'CUI:C0325005', 'CUI:C0321644', 'CUI:C0085393', 'CUI:C0856169', 'CUI:C0552665', 'HP:0003073', 'CUI:C0599755', 'CUI:C0005800', 'CUI:C0233481', 'HP:0001882', 'CUI:C0162700', 'CUI:C0040203', 'ICD10:B60', 'HP:0000975', 'CUI:C0231189', 'CUI:C0003898', 'CUI:C0486382', 'CUI:C0320813', 'CUI:C0030842', 'HP:0001944', 'CUI:C1295927', 'LNC:82748-5', 'CUI:C0949216', 'CUI:C0276852', 'CUI:C1167395', 'CUI:C0022346', 'CUI:C0001655', 'LNC:16427-7', 'OBO:SCTID_21061004', 'SNOMEDCT:17101005', 'CUI:C0024141', 'CUI:C0318329', 'CUI:C0023882', 'CUI:C0325319', 'LNC:43918-2', 'CUI:C0275524', 'CUI:C0024544', 'CUI:C0946608', 'CUI:C0562691', 'CUI:C0383327', 'CUI:C0002871', 'CUI:C0031831', 'CUI:C0040669', 'SNOMEDCT:17800008', 'LNC:43926-5', 'CUI:C0339510', 'CUI:C0323438', 'SNOMEDCT:22405002', 'LNC:67866-4', 'CUI:C0486383', 'CUI:C0392318', 'CUI:C1540912', 'HP:0001433', 'CUI:C0872054', 'CUI:C0026766', 'CUI:C0021294', 'LNC:88451-0', 'CUI:C0009429', 'CUI:C0325003', 'LNC:43087-6', 'HP:0100598', 'CUI:C0004366', 'CUI:C0325253', 'CUI:C0282509', 'CUI:C0013090', 'CUI:C0483368', 'CUI:C0142025', 'LNC:31246-2', 'CUI:C0368726', 'CUI:C0041213', 'SNOMEDCT:26114002', 'HP:0002157', 'CUI:C0282647', 'CUI:C0027361', 'HP:0000718', 'HP:0000099', 'CUI:C0040034', 'CUI:C0013798', 'CUI:C0007634', 'HP:0002013', 'HP:0100724', 'HP:0001895', 'MEDCIN:278146', 'CUI:C0150270', 'CUI:C0376387', 'CUI:C0037813', 'CUI:C0456388', 'CUI:C0085326', 'CUI:C0030660', 'CUI:C0020649', 'CUI:C0012222', 'CUI:C0036743', 'CUI:C3826682', 'LNC:22844-5', 'CUI:C0323499', 'CUI:C0010240', 'LNC:88452-8', 'CUI:C0323465', 'CUI:C0036945', 'CUI:C0013216', 'CUI:C0027567', 'HP:0002383', 'CUI:C0037998', 'CUI:C0035950', 'LNC:22857-7', 'CUI:C0679646', 'SNOMEDCT:43574002', 'CUI:C0684073', 'CUI:C0324818', 'CUI:C0021270', 'Orphanet:108', 'CUI:C0027061', 'CUI:C1510458', 'CUI:C0939219', 'HP:0002719', 'CUI:C1289877', 'MESH:D016792', 'CUI:C0079186', 'CUI:C0162699', 'CUI:C0277564', 'HP:0001876', 'CUI:C0024660', 'CUI:C0231224', 'CUI:C0314622', 'CUI:C0029039', 'LNC:22107-7', 'SNOMEDCT:105652001', 'LNC:47071-6', 'HP:0001824', 'CUI:C0320818', 'CUI:C0035899', 'HP:0001875', 'HP:0001744', 'CUI:C0123759', 'HP:0002315', 'CUI:C0368720', 'LNC:89342-0', 'CUI:C3826681', 'CUI:C1313951', 'CUI:C0026809', 'HP:0001376', 'CUI:C0003320', 'CUI:C0003416', 'CUI:C1332714', 'CUI:C0021742', 'CUI:C0011133', 'CUI:C0011596', 'CUI:C0558024', 'CUI:C0019054', 'CUI:C0030498', 'CUI:C0013227', 'NCIT:C35803', 'CUI:C0035222', 'CUI:C0878544', 'CUI:C0033739', 'CUI:C0320810', 'CUI:C0035078', 'LNC:9585-1', 'CUI:C0238644', 'CUI:C0023418', 'CUI:C0008996', 'CUI:C0034417', 'HP:0001254', 'CUI:C0020971', 'HP:0001658', 'CUI:C0026447', 'HP:0100608', 'LNC:24408-7', 'CUI:C0324145', 'CUI:C2733204', 'HP:0000613', 'CUI:C0003232', 'CUI:C0018270', 'CUI:C0497093', 'LNC:41414-4', 'CUI:C0313532', 'CUI:C0521829', 'CUI:C0324323', 'HP:0002017', 'CUI:C0868945', 'SNOMEDCT:415983007', 'CUI:C0301872', 'CUI:C0242723', 'CUI:C0324996', 'CUI:C0999244', 'LNC:41415-1', 'LNC:27965-3', 'HP:0001259', 'LNC:LP14081-1', 'CUI:C0301838', 'CUI:C0030312', 'SNOMEDCT:24620004', 'CUI:C0012940', 'CUI:C0562690', 'LNC:88700-0', 'CUI:C0019944', 'CUI:C0746336', 'CUI:C0165603', 'CUI:C0039082', 'LNC:22846-0', 'LNC:20689-6', 'CUI:C0023364', 'HP:0002615', 'CUI:C0018019', 'CUI:C0025914', 'LNC:7812-1', 'CUI:C0199960', 'CUI:C0324376', 'CUI:C1532044', 'HP:0004936', 'LNC:22847-8', 'LNC:5054-2', 'HP:0003326', 'CUI:C0011946', 'LNC:22108-5', 'CUI:C0001792', 'CUI:C0320811', 'CUI:C0085316', 'SNOMEDCT:442614005', 'CUI:C0015236', 'CUI:C0008947', 'MONDO:0002428', 'CUI:C0018557', 'CUI:C1457887', 'NCIT:C85491', 'CUI:C1504080', 'CUI:C0242966', 'HP:0000083', 'CUI:C0221460', 'CUI:C0003392', 'NCIT:C90259', 'CUI:C0026976', 'HP:0001945', 'MESH:D041001', 'HP:0003259', 'CUI:C0626053', 'CUI:C0026018', 'CUI:C0325312', 'CUI:C0080332', 'NCIT:C77916', 'CUI:C0325273', 'CUI:C0005767', 'CUI:C0242606', 'CUI:C0449411', 'CUI:C0034865', 'CUI:C0036983', 'LNC:23666-1', 'HP:0001919', 'CUI:C0019116', 'CUI:C0008269', 'MEDCIN:90190', 'CUI:C1532042', 'HP:0005521', 'CUI:C0027362', 'CUI:C1263440', 'CUI:C0320812', 'HP:0003641', 'SNOMEDCT:415980005', 'CUI:C0656383', 'CUI:C0948192', 'CUI:C0272126', 'CUI:C0033741', 'CUI:C0368725', 'LNC:67867-2', 'HP:0012735', 'CUI:C0323512', 'CUI:C0200931', 'CUI:C0585165', 'CUI:C0030705', 'CUI:C0025937', 'CUI:C0019993', 'CUI:C0776499', 'LNC:9584-4', 'CUI:C0026848', 'HP:0002093', 'LNC:22106-9', 'CUI:C0020615', 'CUI:C0004574', 'LNC:31244-7', 'CUI:C1297876', 'CUI:C0086252', 'CUI:C0052796', 'CUI:C0004398', 'NCIT:C27864', 'CUI:C0323517', 'CUI:C0014310', 'CUI:C0580205', 'CUI:C0039194', 'CUI:C0699748', 'CUI:C0007450', 'CUI:C0175923', 'CUI:C0325331', 'CUI:C0051200', 'CUI:C0376568', 'CUI:C0014792', 'LNC:41413-6', 'CUI:C0017725', 'CUI:C1285186', 'CUI:C0007028', 'CUI:C0276846', 'CUI:C0320842', 'CUI:C0702166', 'CUI:C0007570', 'CUI:C0004573', 'CUI:C1263988', 'CUI:C0002878', 'CUI:C0039753', 'CUI:C0043528', 'SNOMEDCT:61370009', 'CUI:C0009326', 'CUI:C0040549', 'CUI:C0325051', 'LNC:23663-8', 'HP:0001878', 'SNOMEDCT:65294004', 'CUI:C0325224', 'CUI:C0012634', 'CUI:C0026249', 'CUI:C0003862', 'SNOMEDCT:1342005', 'SNOMEDCT:53253006', 'CUI:C0184661', 'CUI:C0016875', 'LNC:16426-9', 'CUI:C1273870', 'HP:0012378', 'CUI:C0456386', 'LNC:87547-6', 'CUI:C0085325', 'LNC:88233-2', 'CUI:C0003402', 'CUI:C0042210', 'HP:0100776', 'CUI:C0021289', 'CUI:C1532043', 'CUI:C0042196', 'CUI:C0003420', 'CUI:C0005791', 'CUI:C0276854', 'CUI:C0162326', 'LNC:54217-5', 'HP:0010783', 'HP:0001864', 'CUI:C0871685', 'CUI:C0040558', 'NCIT:C122179', 'CUI:C0320819', 'OBO:ICD10_B60.0', 'CUI:C1511501', 'CUI:C0324306', 'CUI:C0032520', 'CUI:C0009541', 'CUI:C0325216', 'CUI:C1539081', 'CUI:C0032148', 'CUI:C0063393', 'CUI:C0949466', 'LNC:88450-2', 'CUI:C0552664', 'CUI:C0599779', 'CUI:C0008059', 'CUI:C0042211', 'CUI:C0058099', 'CUI:C0038038', 'CUI:C0006035', 'CUI:C0022658', 'CUI:C0199176', 'SNOMEDCT:397072005', 'CUI:C0035804', 'SNOMEDCT:64950006', 'CUI:C1297409', 'CUI:C0040649', 'MEDDRA:10002067', 'CUI:C0282510', 'OBO:COHD_439730', 'CUI:C0324180', 'CUI:C0003316', 'NCBITaxon:32594', 'CUI:C0020268', 'SNOMEDCT:1102005', 'CUI:C0277785', 'SNOMEDCT:76828008', 'MONDO:0002009', 'CUI:C0023281', 'CUI:C0320830', 'CUI:C0549634', 'HP:0001943', 'CUI:C0024115', 'CUI:C0003064', 'NCIT:C122180', 'SNOMEDCT:38602006', 'SNOMEDCT:415979007', 'HP:0002098', 'CUI:C0325222', 'CUI:C0024109', 'CUI:C0199470', 'CUI:C0041942', 'LNC:22849-4', 'CUI:C0023358', 'CUI:C0597305', 'SNOMEDCT:608923007', 'CUI:C0376261', 'CUI:C0024282', 'CUI:C0011315', 'CUI:C0323406', 'CUI:C0030054', 'CUI:C0679818', 'CUI:C0035956', 'LNC:22848-6', 'LNC:22850-2', 'CUI:C0430054', 'MESH:D017282', 'CUI:C0175925', 'CUI:C0040196', 'LNC:88728-1', 'CUI:C0001047', 'HP:0000093', 'CUI:C0023756', 'CUI:C0276848', 'CUI:C0017462', 'CUI:C0002895', 'CUI:C1292533', 'CUI:C0999517', 'CUI:C0005841', 'CUI:C0011911', 'CUI:C0948145', 'MEDDRA:10003964', 'LNC:42581-9', 'CUI:C0007018', 'CUI:C0043210', 'CUI:C0025252', 'CUI:C0086418', 'HP:0100827', 'CUI:C0024291', 'HP:0002908', 'LNC:16425-1', 'CUI:C0325174', 'CUI:C0006034', 'CUI:C0035005', 'HP:0002240', 'CUI:C0320827', 'MONDO:0021136', 'CUI:C0733470', 'CUI:C0024198', 'CUI:C0070129', 'CUI:C0455014', 'CUI:C0023779', 'CUI:C0031268', 'CUI:C0220908', 'CUI:C0323515', 'LNC:22845-2', 'CUI:C0063413', 'CUI:C0024400', 'LNC:22104-4', 'LNC:26622-1', 'LNC:60521-2', 'CUI:C0325001', 'ICD10CM:B60', 'CUI:C0687759', 'LNC:43085-0', 'CUI:C1482264', 'LNC:22854-4', 'CUI:C0948202', 'CUI:C0272286', 'CUI:C0003241', 'NCBITaxon:5866', 'CUI:C0318328', 'CUI:C0036974', 'HP:0001873', 'CUI:C0325175', 'CUI:C0025266', 'CUI:C0015970', 'CUI:C0323454', 'CUI:C0011777', 'LNC:42580-1', 'LNC:10347-3', 'CUI:C0033477', 'CUI:C0003460', 'LNC:LA17804-8', 'CUI:C0014441', 'CUI:C0320816', 'CUI:C0042567', 'CUI:C0034693', 'CUI:C0004368', 'MESH:D016793', 'CUI:C0070532', 'CUI:C0320821', 'UniProtKB:Q6UXR4', 'CUI:C1511661', 'LNC:88461-9', 'CUI:C0018561', 'CUI:C0009932', 'LNC:47073-2', 'HP:0000716', 'LNC:22851-0', 'CUI:C0311392', 'CUI:C0585171', 'LNC:23664-6', 'CUI:C0012860', 'CUI:C0324183', 'LNC:34940-7', 'SNOMEDCT:415982002', 'CUI:C0728940', 'CUI:C0009017', 'CUI:C0243077', 'HP:0001973', 'CUI:C0344211', 'CUI:C0000934', 'LNC:22856-9', 'DOID:2789', 'CUI:C0009676', 'CUI:C0132172', 'SNOMEDCT:112420006', 'HP:0001888', 'CUI:C0237401', 'CUI:C0029235', 'LNC:42641-1', 'CUI:C0037993', 'CUI:C0021368', 'CUI:C0323435', 'CUI:C0036055', 'CUI:C0320843', 'CUI:C0053355', 'CUI:C1444783', 'CUI:C0086944', 'LNC:22855-1', 'CUI:C0009566', 'CUI:C0010418', 'CUI:C0051542', 'EFO:0001067', 'CUI:C0009450', 'CUI:C0086565', 'SNOMEDCT:32748003', 'CUI:C0011900', 'CUI:C0020964', 'CUI:C0006801', 'SNOMEDCT:86432002', 'SNOMEDCT:106615005']
and the error:
- 2020-07-06 18:36:17.402073 ERROR: Encountered a problem using NodeSynonymizer: Traceback (most recent call last):
File "/Users/aglen/translator/RTX_new/RTX/code/ARAX/test/../ARAXQuery/Expand/expand_utilities.py", line 218, in get_preferred_curies
curie_list = node_synonymizer.get_equivalent_curies(curies, kg_name="KG2")
File "/Users/aglen/translator/RTX_new/RTX/code/ARAX/ARAXQuery/Expand/../../NodeSynonymizer/node_synonymizer.py", line 1118, in get_equivalent_curies
if results[row[0]] is None:
KeyError: 'ORPHANET:108'
Yeah, orphanet is a bit awkward because KG2 uses Orphanet: and SRI normalizer uses ORPHANET: I though I figured it out, but there is a bug in the build process I suppose. I put in a little patch that should fix this. Please try pulling from master
I noticed some interesting behavior when using KGNodeIndex.get_equivalent_curies()... specifically regarding the returned synonyms for these two curies:
kgni.get_equivalent_curies(curie='UniProtKB:Q13330', kg_name='KG1')
-->['UniProtKB:Q13330']
kgni.get_equivalent_curies(curie='UniProtKB:Q9BRL8', kg_name='KG1')
-->['UniProtKB:Q9BRL8', 'UniProtKB:Q13330']
I'm wondering why
UniProtKB:Q9BRL8
's synonyms includeUniProtKB:Q13330
, butUniProtKB:Q13330
's synonyms don't includeUniProtKB:Q9BRL8
?I was operating under the assumption that synonyms should be symmetrical in this sense... but maybe that's not correct?