Georgetown-IR-Lab / QuickUMLS

System for Medical Concept Extraction and Linking
MIT License
382 stars 95 forks source link

"Fluorouracil" cause KeyError on 1.3 #52

Closed bevankoopman closed 4 years ago

bevankoopman commented 5 years ago

I'm getting the following KeyError when using the input text "Fluorouracil"

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    matcher.match(text, best_match=True, ignore_syntax=False)
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/quickumls/core.py", line 416, in match
    matches = self._get_all_matches(ngrams)
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/quickumls/core.py", line 295, in _get_all_matches
    cuisem_match = sorted(self.cuisem_db.get(match))
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/quickumls/toolbox.py", line 262, in get
    cuis = pickle.loads(self.cui_db.Get(db_key_encode(term)))
KeyError

Code to reproduce:

import quickumls

text="Fluorouracil"
quickumls_directory='/Users/koo01a/tools/quickumls-data'

matcher = quickumls.QuickUMLS(quickumls_directory)
matcher.match(text, best_match=True, ignore_syntax=False)

From the above error I added a print(term) to toolbox.py:261. This then gives the output:

Fluorouracil
fluorouracil
FluorouracilÂ

Note the unicode character on the last entry that seems to be the issue.

soldni commented 5 years ago

Bevan,

Thank you for the bug report! Which python version, UMLS installation, and operating system are you running QuickUMLS on?

Best, Luca

bevankoopman commented 5 years ago

UMLS 2018AA Python 3.5.1 macOS 10.12.6 QuickUMLS installed via pip

soldni commented 4 years ago

@bevankoopman,

I couldn't reproduce this error on my part, but I have added a try...catch statement to prevent crashing.

I've also added support for unqlite as an alternative to leveldb for storage of CUIs and Semantic Types (see #54 for more details). Other than better multi- processing and threading support, unqlite should have better support for unicode. You can try it in brach soldni/conc by installing QuickUMLS with option -d unqlite.

-Luca