delph-in / pydelphin

Python libraries for DELPH-IN
https://pydelphin.readthedocs.io/
MIT License
79 stars 27 forks source link

Implement REPP masking #331

Closed goodmami closed 3 years ago

goodmami commented 3 years ago

The latest ERG makes use of the new 'mask' operator (=) for REPP, as described in the email thread starting here:

http://lists.delph-in.net/archives/developers/2020/003107.html

Essentially, substrings matching a mask pattern are prevented from further modification. For example, the following masks email addresses such that later punctuation-splitting rules do not break up email addresses:

=<?[\p{L}\p{N}._-]+@[\p{L}\p{N}_-]+(?:\.[\p{L}\p{N}_-]+)*\.[\p{L}\p{N}]+>?

Masked sections can be tracked with a BIO sequential-tagging scheme so adjacent masks work even when content is inserted between them.