Closed ctm closed 4 months ago
unicode_normalization looks like a better choice.
BTW, unicode-display-width may be useful for determining how many characters to accept after we've stripped annoying diacriticals.
Here's the magic incantation to DeZalgo a String:
trait DeZalgo {
fn dezalgoed(self) -> Self;
}
impl DeZalgo for String {
fn dezalgoed(self) -> Self {
use {
unicode_normalization::UnicodeNormalization,
unicode_properties::{GeneralCategory::NonspacingMark, UnicodeGeneralCategory},
};
String::from_iter(
self.chars()
.nfc()
.filter(|c| c.general_category() != NonspacingMark),
)
}
}
I added .dezalgoed()
to the String
we get from lobby messages and chat messages and it makes a big difference.
Deploying now.
BTW, I didn't use unicode-display-width, because I don't really want to encourage people pasting in long Zalgo text. What I've implemented should be good enough to not annoy other people too much and to show a potential hacker that we've at least done some work in this area.
Consider doing a decompose / recompose round-trip on text using icu_normalizer so that zalgo text will not bleed.
I haven't actually played with icu_normalizer, but it looks like it would be the Rust equivalent of what is suggested in this Stack Overflow answer.
In practice, nobody is actually abusing mb2 via stacked diacritical marks. However, since I've already done some reading about Zalgo text so that I could find and fix a panic (#1340), it makes sense to create this issue and then see if I can implement the solution quickly, just to prevent it from biting us later.
I don't, however, think it is worth me spending more than an hour in this investigation, so if I can't come up with a quick fix, I'll strip the
high priority
label.