It makes much more sense for the edges between nodes to be graphemes, not chars.
For example, the following code fails unexpectedly:
use word_filter::WordFilterBuilder;
let filter = WordFilterBuilder::new().words(&["bãr"]).build();
assert_eq!(filter.find("bããr"), vec!["bãr"].into_boxed_slice());
The repeated check fails because "ã" is actually two chars, even though it appears to users visually to be one. That is why the concept of graphemes should be used instead.
The unicode-segmentation crate provides methods to do just this, and seems to be the popular choice. This is already an optional dependency for one of the built-in censor generators, so perhaps now is the time to make it a required dependency and go full-force with it. Use cases will most likely always revolve around graphemes instead of chars anyway, as the main case is to affect how user-provided data looks to other users.
It makes much more sense for the edges between nodes to be graphemes, not chars.
For example, the following code fails unexpectedly:
The repeated check fails because "ã" is actually two chars, even though it appears to users visually to be one. That is why the concept of graphemes should be used instead.
The
unicode-segmentation
crate provides methods to do just this, and seems to be the popular choice. This is already an optional dependency for one of the built-in censor generators, so perhaps now is the time to make it a required dependency and go full-force with it. Use cases will most likely always revolve around graphemes instead of chars anyway, as the main case is to affect how user-provided data looks to other users.