This PR makes several small tweaks to deal with overlapping/nested entities better. First, we now take the offset of an entity to be the sum of the start/end character offsets of its first mentions (as opposed to just taking the end character offset). Second, _search_ent was refactored (renamed to _find_first_mention) to accept keyword arguments to Pattern.search. This allows use to provide pos and endpos indices to search within text, leading to more accurate offsets. Together, these changes lead to better sorting of nested/overlapping entities.
TODO
[ ] Test on all datasets, and check the impact on performance. We only expect this to impact datasets with nested/compound entities.
Overview
This PR makes several small tweaks to deal with overlapping/nested entities better. First, we now take the offset of an entity to be the sum of the start/end character offsets of its first mentions (as opposed to just taking the end character offset). Second,
_search_ent
was refactored (renamed to_find_first_mention
) to accept keyword arguments toPattern.search
. This allows use to providepos
andendpos
indices to search withintext
, leading to more accurate offsets. Together, these changes lead to better sorting of nested/overlapping entities.TODO