Anders429 / word_filter

A Word Filter for filtering text.
Apache License 2.0
1 stars 0 forks source link

Censoring Full Graphemes #41

Closed Anders429 closed 3 years ago

Anders429 commented 3 years ago

A use case for separators is to prevent arbitrary combining characters being used to trick the filter. For example, a filter generated by

use word_filter_codegen::WordFilterGenerator;

WordFilterGenerator::new().word("foo").generate("FILTER");

can be tricked by the input "f̃õõ``. Adding the character'\u{303}'as a separator fixes this. However,FILTER.censor("f̃õõ")` then results in the string "***̃", with the final combining character still present.

The cause is that the algorithm ignores trailing separators. In most cases this is fine, but with grapheme clusters it's not ideal, since they're part of the same grapheme.

It may make more sense to actually handle the input one grapheme at a time, instead of one character at a time. If, for example, "foõ should be considered acceptable as an input (where no separators exist in the WordFilter), then handling a grapheme at a time would be the correct way to proceed. This may be desirable since "foõ" (as a single character rather than a grapheme cluster) would not match.

Anders429 commented 3 years ago

This is actually super tricky, because it is not a simple task to identify whether an isolated character is part of a grapheme or not.

It seems the easiest way to handle this is to include the concept of an "inclusive separator", which does not push the AppendedSeparator value to the stack upon return. This puts the burden on the user to identify which separators should be included at the end of matches, which allows for combining characters to be listed.

Anders429 commented 3 years ago

An alternative idea is to simply check during computation whether the grapheme was multiple characters. If it was, then don't check the AppendedSeparator on the stack. Otherwise, check the stack like normal.

That way, graphemes containing separators will still be handled appropriately. However, we'll need to make sure that graphemes with only separators are not added still.

Ideally, I would like to not have to insert a new kind of separator. That's confusing on the user-end, and I think there is a way to keep track during computation.