The code currently assumes sentDetect splits text into contiguous segments. This is not guaranteed. In particular, it will skip over extra whitespace. For instance,
The first three sentences are offset correctly. However, extra spaces at the end of the line after three., and extra space after four. will not be included in the results of sentDetect, which results in offsets for sentences four and five to be incorrect. As a consequence, for example, since brat checks spans vs their contents, extra space in input files results in unusable (erroneous) .ann files.
Instead, I suggest using sentPosDetect, which returns exact spans where the sentences were found.
The code currently assumes
sentDetect
splits text into contiguous segments. This is not guaranteed. In particular, it will skip over extra whitespace. For instance,The first three sentences are offset correctly. However, extra spaces at the end of the line after
three.
, and extra space afterfour.
will not be included in the results ofsentDetect
, which results in offsets for sentences four and five to be incorrect. As a consequence, for example, since brat checks spans vs their contents, extra space in input files results in unusable (erroneous).ann
files.Instead, I suggest using
sentPosDetect
, which returns exact spans where the sentences were found.