Anders429 / word_filter

A Word Filter for filtering text.
Apache License 2.0
1 stars 0 forks source link

Separator settings #62

Closed Anders429 closed 3 years ago

Anders429 commented 3 years ago

In profanity_filter, some issues have been encountered regarding exceptions and separators. Matching separators inside exceptions causes a lot of false positives (i.e. "eat ass" doesn't get censored to "eat ***" because it matches the exception "tass"). It may be useful to allow settings for separators, allowing them to be turned off within exceptions (or within words, if desired I guess), be required at the beginning of exceptions (to indicate a new "word"), etc.

Anders429 commented 3 years ago

To have different settings apply to words and exceptions, internally words and exceptions will need to be separated into two PDAs. This will only be necessary if the settings for words and exceptions are different, although initially we can always separate them to keep things simple.

The actual settings can be provided using a bitflag struct (with the bitflags crate).

Anders429 commented 3 years ago

Initial work implemented on the separator branch. There are still some loose ends with aliases, however. It may make sense to separate aliases into two sets, one for words and one for exceptions, and use the respective separator flag settings for both. If they are the same, then they'll be merged at the minimization step anyway, but if they're different then the difference will be preserved within aliases.

Additionally, for future-proofing, it may be good to use constants for the state indices in the main crate. They should match the same constant values in the codegen crate. I hate to consider making a word_filter_shared crate just for constants, but that would be the simplest solution to make sure these values are always in step. How little is too little for a crate?

Anders429 commented 3 years ago

A small problem that should be easily fixable: need to make sure the states at reserved indices are not merged together during minimization. For example, if both the word and exception root states are identical, they will be merged, causing the separator state to be the new exception state, and some other random state becoming the new separator state (or in edge cases there will only be two states, causing a panic).

To solve this, the minimization logic needs to be altered to not merge the first three states.

Anders429 commented 3 years ago

Switching to default to all separator flags being set, since most users will want to use separators between all characters. The default will be separators between every character, with the option to opt-out using the separator flags.

Anders429 commented 3 years ago

I have separated aliases into word and exception sets, so that the separator settings can be the same as the PDAs their applied on.

This feature should be basically done. It will be released as version 0.7.0 after the separator branch is merged.

Anders429 commented 3 years ago

This has been merged and released as part of version 0.7.0.