Georgetown-IR-Lab / QuickUMLS

System for Medical Concept Extraction and Linking
MIT License
369 stars 95 forks source link

[BUG] spacy component ValueError: "A token can only be part of one entity" #60

Open burgersmoke opened 4 years ago

burgersmoke commented 4 years ago

Describe the bug When QuickUMLS concept matches occur over the same token, Spacy reports an error like the

To Reproduce Using QuickUMLS version 1.5 or higher, run the following sample. Note that if the matching threshold is set higher (e.g. 1.0) this exception may not occur.


nlp = spacy.load('en_core_web_sm')

quickumls_component = SpacyQuickUMLS(nlp, 'PATH_TO_QUICKUMLS_DATA', threshold = 0.25)

nlp.add_pipe(quickumls_component)

doc = nlp('Pt c/o shortness of breath, chest pain, nausea, vomiting, diarrrhea')

Environment

Additional context @soldni originally reported in this pull request.

The comments are reproduced here:

I was doing some tests and notice that, in the latest version of spaCy (2.3), I get an error if two entities overlap:

Traceback (most recent call last): File "test.py", line 10, in doc = nlp('Pt c/o shortness of breath, chest pain, nausea, vomiting, diarrrhea') File "/home/ubuntu/anaconda3/envs/quickumls/lib/python3.7/site-packages/spacy/language.py", line 449, in call doc = proc(doc, **component_cfg.get(name, {})) File "/home/ubuntu/qumls_1.4/QuickUMLS/quickumls/spacy_component.py", line 78, in call doc.ents = list(doc.ents) + [span] File "doc.pyx", line 553, in spacy.tokens.doc.Doc.ents.set ValueError: [E103] Trying to set conflicting doc.ents: '(8, 10, 'C0008031')' and '(8, 10, 'C2926613')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap. According to this GitHub issue, this seems to be the expected behavior since at least spaCy (2.1.3)

I am planning to solve this by either (a) adding a custom extension type called quickumls to docs, or (b) have each span be a list of matches. Any preference? In the first case, you'd access matches as follows:

for ent in doc..quickumls: print(ent.text, ent.label, ent..semtypes, ent..similarity) In the latter, you'd use the following syntax:

for ent in doc.ents: for match in ent._.quickumls: print(ent.text, match.cui, match.semtypes, match.similarity) I personally prefer the second one, as it makes more sense to me to just have an entity with multiple labels, but please do let me know which one you think would make more sense.

burgersmoke commented 4 years ago

This is indeed one of the conventions of current spaCy versions as you noted from this GitHub issue.

So yes, there are a few ways to expose more of the high level match data so that multiple CUI matches might be attached to each entity. I like the idea of adding additional metadata about candidate matches.

That said, I also really like the idea of entities that can be emitted and used "out of the box" for other downstream tasks such as visualization (e.g. displaCy), and later stage processing like what our group is doing with medspacy. Specifically, I'm thinking that it would be valuable to keep within spaCy conventions as much as possible so that downstream components like cycontext that @abchapman93 has been working on.

So I still owe some extra due diligence and testing on this, but right now my proposal is going to be whether we can do both.

In other words, as a general principle, could we emit spaCy-standard entities while also adding extensions in the "underscore" but not make this necessary?

Finally, I wanted to do some better debugging to find out why these overlap matches were still returned when I set the overlapping_criteria. I'll do my part on this question and report back.

burgersmoke commented 4 years ago

@soldni I apologize. Sometimes I need to type something out before I realize my mistake. In this case, I previously read your documentation incorrectly. Specifically, I just looked at what happens with overlapping_criteria in implementation. I see now that there's not a "winner" selected, but this drives a sorting function. This makes sense.

So here's what I propose: I'm going to change SpacyQuickUMLS to use the ordering of matches as criteria for the component to prevent overlapping entities. I see this as "minimum viable" and then I could take more time to think about how to add additional matching information via an extension to the Span.

How does this plan sound?