As title suggests, this PR fixes a significant bug that resulted in sentences not being properly created. The issue was due to the fact that merged spans were incorrectly parsed as a uuid instead of span attribute when creating a SpanGroup object.
Beside the fix for the bug above, this PR also adds better handling of words when performing sentence segmentation: now, instead of using just the first symbol of a word/token, it uses the text attribute when available, and joins all symbols when not available.
As title suggests, this PR fixes a significant bug that resulted in sentences not being properly created. The issue was due to the fact that merged spans were incorrectly parsed as a uuid instead of span attribute when creating a SpanGroup object.
Beside the fix for the bug above, this PR also adds better handling of words when performing sentence segmentation: now, instead of using just the first symbol of a word/token, it uses the
text
attribute when available, and joins all symbols when not available.