allenai / mmda

multimodal document analysis
Apache License 2.0
159 stars 18 forks source link

Fix for `SpanGroup` creation in `PysbdSentenceBoundaryPredictor` that resulted in no sentences. #98

Closed soldni closed 2 years ago

soldni commented 2 years ago

As title suggests, this PR fixes a significant bug that resulted in sentences not being properly created. The issue was due to the fact that merged spans were incorrectly parsed as a uuid instead of span attribute when creating a SpanGroup object.

Beside the fix for the bug above, this PR also adds better handling of words when performing sentence segmentation: now, instead of using just the first symbol of a word/token, it uses the text attribute when available, and joins all symbols when not available.