Anders429 / word_filter

A Word Filter for filtering text.
Apache License 2.0
1 stars 0 forks source link

`Node` edges should be graphemes, not chars #18

Closed Anders429 closed 3 years ago

Anders429 commented 3 years ago

It makes much more sense for the edges between nodes to be graphemes, not chars.

For example, the following code fails unexpectedly:

use word_filter::WordFilterBuilder;

let filter = WordFilterBuilder::new().words(&["bãr"]).build();

assert_eq!(filter.find("bããr"), vec!["bãr"].into_boxed_slice());

The repeated check fails because "ã" is actually two chars, even though it appears to users visually to be one. That is why the concept of graphemes should be used instead.

The unicode-segmentation crate provides methods to do just this, and seems to be the popular choice. This is already an optional dependency for one of the built-in censor generators, so perhaps now is the time to make it a required dependency and go full-force with it. Use cases will most likely always revolve around graphemes instead of chars anyway, as the main case is to affect how user-provided data looks to other users.