Open burgersmoke opened 4 years ago
This is indeed one of the conventions of current spaCy versions as you noted from this GitHub issue.
So yes, there are a few ways to expose more of the high level match data so that multiple CUI matches might be attached to each entity. I like the idea of adding additional metadata about candidate matches.
That said, I also really like the idea of entities that can be emitted and used "out of the box" for other downstream tasks such as visualization (e.g. displaCy), and later stage processing like what our group is doing with medspacy. Specifically, I'm thinking that it would be valuable to keep within spaCy conventions as much as possible so that downstream components like cycontext that @abchapman93 has been working on.
So I still owe some extra due diligence and testing on this, but right now my proposal is going to be whether we can do both.
In other words, as a general principle, could we emit spaCy-standard entities while also adding extensions in the "underscore" but not make this necessary?
Finally, I wanted to do some better debugging to find out why these overlap matches were still returned when I set the overlapping_criteria
. I'll do my part on this question and report back.
@soldni I apologize. Sometimes I need to type something out before I realize my mistake. In this case, I previously read your documentation incorrectly. Specifically, I just looked at what happens with overlapping_criteria
in implementation. I see now that there's not a "winner" selected, but this drives a sorting function. This makes sense.
So here's what I propose: I'm going to change SpacyQuickUMLS
to use the ordering of matches as criteria for the component to prevent overlapping entities. I see this as "minimum viable" and then I could take more time to think about how to add additional matching information via an extension to the Span
.
How does this plan sound?
Describe the bug When QuickUMLS concept matches occur over the same token, Spacy reports an error like the
To Reproduce Using QuickUMLS version 1.5 or higher, run the following sample. Note that if the matching threshold is set higher (e.g. 1.0) this exception may not occur.
Environment
Additional context @soldni originally reported in this pull request.
The comments are reproduced here: