Georgetown-IR-Lab / QuickUMLS

System for Medical Concept Extraction and Linking
MIT License
376 stars 95 forks source link

"leucovorin" causes a KeyError #13

Closed slbayer closed 6 years ago

slbayer commented 6 years ago

Version 1.2.2 downloaded from the releases section, on macOS 10.13.3, Python 2.7.11, successfully installed all dependencies and built the QuickUMLS database from UMLS_2017AB, after installing UMLS using only the active vocabularies. I get this error:

>>> m.match("The patient is taking leucovorin.")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "quickumls.py", line 321, in match
    matches = self._get_all_matches(ngrams)
  File "quickumls.py", line 221, in _get_all_matches
    cuisem_match = sorted(self.cuisem_db.get(match))
  File "toolbox.py", line 258, in get
    cuis = pickle.loads(self.cui_db.Get(db_key_encode(term)))
KeyError

In quickumls.py, line 216, ngram_cands is

['leucovorin', 'Leucovorin', 'Leucovorin\xc3\x82', 'L-leucovorin', 'S-leucovorin', 'S leucovorin', '6S leucovorin', '6S-leucovorin']

and it's barfing on the third one.

slbayer commented 6 years ago

Inserting a try-except inside get() at line 255 of toolbox.py seems to fix the problem, whatever it is.

soldni commented 6 years ago

Thank you for the report! I'll be investigating this issue shortly.

soldni commented 6 years ago

I can't reproduce this issue on Python 2.7.14. Looking at ngram_cands, I get the following candidates:

['leucovorin', 'Leucovorin', 'Leucovorin\xc2\xae', 'L-leucovorin', 'S-leucovorin', 'S leucovorin', '6S leucovorin', '6S-leucovorin']

which return no errors. Note how the third candidate is Leucovorin® which matches concept [A24104223/LNC/LA/LA14337-2] in UMLS. Could it be due to some data corruption between installation and matching?

slbayer commented 6 years ago

Hm. Must be something like that on my end, because ngram_cands for me looks like this:

['leucovorin', 'Leucovorin', 'Leucovorin\xc3\x82', 'L-leucovorin', 'S-leucovorin', 'S leucovorin', '6S leucovorin', '6S-leucovorin']

The third candidate in your list does, indeed, return something from the DB. Not sure what's going on, but I don't have time to figure it out. Sigh. Thanks.