Downstream tokenizers should be able to generate a list of "bad words" that upstream tokenizers will use to invalidate matches.
Suppose, for example, that we have an entity tokenizer that is aware of the entity, medium marble, followed by an attribute tokenizer that is aware of the attribute, medium. The upstream, entity tokenizer should never report medium marble as a match for medium because that match consists solely of bad words from the attribute tokenizer. Reporting medium marble as a match for marble would be fine because marble wouldn't be on the attribute tokenizer's list of bad words.
Downstream tokenizers should be able to generate a list of "bad words" that upstream tokenizers will use to invalidate matches.
Suppose, for example, that we have an entity tokenizer that is aware of the entity,
medium marble
, followed by an attribute tokenizer that is aware of the attribute,medium
. The upstream, entity tokenizer should never reportmedium marble
as a match formedium
because that match consists solely of bad words from the attribute tokenizer. Reportingmedium marble
as a match formarble
would be fine becausemarble
wouldn't be on the attribute tokenizer's list of bad words.