pre-processing to anticipate empty tokens and spliced tokens

This issue is an attempt to evaluate why the tokenization and normalization process generates empty tokens and spliced tokens in the first place. Can we review the tokenization process up close, checking:

what pulldom is doing to serialize the XML as a string in the extract() function
Try running just the tokenization portion of the script and watch it carefully.
Then the tokenizaton + normalization

For <add> and other elements in the ignore list, they are perhaps being removed together with a space following it. This may cause the preceding and following grams to be fused together into a token.

For <lb/> and other elements in the inlineEmpty list, they are possibly being removed in a way that preserves the spaces around them, generating an extra space that gets interpreted as an empty token.

Solutions:

I think we may not want to have an ignore list at all (these elements never appear in the output, and I think we have found out the hard way that we need all the markup for S-GA).
I think we may also want to intervene to remove the space after <lb/> and other inlineEmpty friends: a move of normalization that may need to occur before tokenization.

FrankensteinVariorum / fv-collation

pre-processing to anticipate empty tokens and spliced tokens #80