dwadden / dygiepp

Span-based system for named entity, relation, and event extraction.
MIT License
573 stars 120 forks source link

Add quality check for incorrect sentence splits #117

Closed serenalotreck closed 1 year ago

serenalotreck commented 1 year ago

Adds a feature to check for and correct incorrect sentence splits caused by spacy/scispacy, when those errors are indicated by a relation or an entity being split across multiple sentences in the original tokenization. Most frequently, I have noticed this in scientific text around spltis on periods that are actually part of abbreviations; for example,. "Pseudomonas syringae pv. tabaci" splits on the period into two different sentences.

This correction prevents downstream errors when running DyGIE++ models, as if left uncorrected, documents with incorrect sentence splits will throw an exception because they look like cross-sentence relations/entities.