FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

pre-processing to anticipate empty tokens and spliced tokens #80

Open ebeshero opened 2 years ago

ebeshero commented 2 years ago

This issue is an attempt to evaluate why the tokenization and normalization process generates empty tokens and spliced tokens in the first place. Can we review the tokenization process up close, checking:

For <add> and other elements in the ignore list, they are perhaps being removed together with a space following it. This may cause the preceding and following grams to be fused together into a token.

For <lb/> and other elements in the inlineEmpty list, they are possibly being removed in a way that preserves the spaces around them, generating an extra space that gets interpreted as an empty token.

Solutions: