Georgetown-IR-Lab / QuickUMLS

System for Medical Concept Extraction and Linking
MIT License
369 stars 95 forks source link

Allow QuickUMLS to be used as component in spacy pipeline #57

Closed burgersmoke closed 4 years ago

burgersmoke commented 4 years ago

This can be used as an entity matcher for UMLS concepts in other modular spacy pipelines. This is already being used in an operational capacity for syndromic surveillance in a healthcare system. More info on this can be provided upon request.

This is in response to this issue: https://github.com/Georgetown-IR-Lab/QuickUMLS/issues/40

burgersmoke commented 4 years ago

Great. I appreciate all of this feedback since I know this was a rough PR.

One question: You mention moving spacy_example_pipeline.py. Are you proposing that it goes into /quickumls or into another location? Maybe something like /examples?

Nevermind. I did not understand the previous comment, but understood it later. I've moved the example code to the README instead of the example .py file.

Now that I have your feedback, I'll make these changes tonight or Tuesday and update the PR.

ljak commented 4 years ago

Went through it quickly, seems nice for first iteration ! Thanks for work.

burgersmoke commented 4 years ago

OK, I think this is ready to go now with b06b882. Besides the items above, I also added documentation, fixed some comments and fixed the hard-coded QuickUMLS arguments to be passed in as **kwargs instead. Now I feel comfortable with this PR. :)

burgersmoke commented 4 years ago

Any update on this? I believe I've resolved the code issue and remaining questions. Are there any other questions about this change I can answer? I'm wanting to move this forward since we're planning to leverage this to make medspacy and QuickUMLS better with some mutually beneficial upcoming work.

burgersmoke commented 4 years ago

Fantastic. Thank you @soldni !

soldni commented 4 years ago

@burgersmoke thank you for your hard work! will push out a proper release (v.1.5) momentarily.

burgersmoke commented 4 years ago

My pleasure. There's more to come as QuickUMLS is an important part of our plans for medspacy. I'll continue to follow up with you and anyone else interested.

soldni commented 4 years ago

I was doing some tests and notice that, in the latest version of spaCy (2.3), I get an error if two entities overlap:

Traceback (most recent call last):
  File "test.py", line 10, in <module>
    doc = nlp('Pt c/o shortness of breath, chest pain, nausea, vomiting, diarrrhea')
  File "/home/ubuntu/anaconda3/envs/quickumls/lib/python3.7/site-packages/spacy/language.py", line 449, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/home/ubuntu/qumls_1.4/QuickUMLS/quickumls/spacy_component.py", line 78, in __call__
    doc.ents = list(doc.ents) + [span]
  File "doc.pyx", line 553, in spacy.tokens.doc.Doc.ents.__set__
ValueError: [E103] Trying to set conflicting doc.ents: '(8, 10, 'C0008031')' and '(8, 10, 'C2926613')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

According to this GitHub issue, this seems to be the expected behavior since at least spaCy (2.1.3)

I am planning to solve this by either (a) adding a custom extension type called quickumls to docs, or (b) have each span be a list of matches. Any preference? In the first case, you'd access matches as follows:

for ent in doc._.quickumls:
    print(ent.text, ent.label_, ent._.semtypes, ent._.similarity)

In the latter, you'd use the following syntax:

for ent in doc.ents:
    for match in ent._.quickumls:
        print(ent.text, match.cui, match.semtypes, match.similarity)

I personally prefer the second one, as it makes more sense to me to just have an entity with multiple labels, but please do let me know which one you think would make more sense.

burgersmoke commented 4 years ago

You are absolutely correct and I debated raising it earlier, but I wanted to take steps towards the component being included first so that we could have this conversation. I've opened this as a new issue here. I'll add my initial thoughts here:

https://github.com/Georgetown-IR-Lab/QuickUMLS/issues/60