BadWord list generation

Downstream tokenizers should be able to generate a list of "bad words" that upstream tokenizers will use to invalidate matches.

Suppose, for example, that we have an entity tokenizer that is aware of the entity, medium marble, followed by an attribute tokenizer that is aware of the attribute, medium. The upstream, entity tokenizer should never report medium marble as a match for medium because that match consists solely of bad words from the attribute tokenizer. Reporting medium marble as a match for marble would be fine because marble wouldn't be on the attribute tokenizer's list of bad words.

MikeHopcroft / ShortOrder

BadWord list generation #4