JohnGiorgi / seq2rel-ds

This is a companion repository to seq2rel (https://github.com/JohnGiorgi/seq2rel) which aims to make it easy to generate training data.
5 stars 1 forks source link

Better sorting #49

Closed JohnGiorgi closed 2 years ago

JohnGiorgi commented 2 years ago

Overview

This PR makes several small tweaks to deal with overlapping/nested entities better. First, we now take the offset of an entity to be the sum of the start/end character offsets of its first mentions (as opposed to just taking the end character offset). Second, _search_ent was refactored (renamed to _find_first_mention) to accept keyword arguments to Pattern.search. This allows use to provide pos and endpos indices to search within text, leading to more accurate offsets. Together, these changes lead to better sorting of nested/overlapping entities.

TODO