ctm / mb2-doc

Mb2, poker software
https://devctm.com
7 stars 2 forks source link

Zalgo text is annoying #1341

Closed ctm closed 4 months ago

ctm commented 4 months ago

Consider doing a decompose / recompose round-trip on text using icu_normalizer so that zalgo text will not bleed.

I haven't actually played with icu_normalizer, but it looks like it would be the Rust equivalent of what is suggested in this Stack Overflow answer.

In practice, nobody is actually abusing mb2 via stacked diacritical marks. However, since I've already done some reading about Zalgo text so that I could find and fix a panic (#1340), it makes sense to create this issue and then see if I can implement the solution quickly, just to prevent it from biting us later.

I don't, however, think it is worth me spending more than an hour in this investigation, so if I can't come up with a quick fix, I'll strip the high priority label.

ctm commented 4 months ago

unicode_normalization looks like a better choice.

ctm commented 4 months ago

BTW, unicode-display-width may be useful for determining how many characters to accept after we've stripped annoying diacriticals.

ctm commented 4 months ago

Here's the magic incantation to DeZalgo a String:

trait DeZalgo {
    fn dezalgoed(self) -> Self;
}

impl DeZalgo for String {
    fn dezalgoed(self) -> Self {
        use {
            unicode_normalization::UnicodeNormalization,
            unicode_properties::{GeneralCategory::NonspacingMark, UnicodeGeneralCategory},
        };

        String::from_iter(
            self.chars()
                .nfc()
                .filter(|c| c.general_category() != NonspacingMark),
        )
    }
}

I added .dezalgoed() to the String we get from lobby messages and chat messages and it makes a big difference.

Deploying now.

ctm commented 4 months ago

BTW, I didn't use unicode-display-width, because I don't really want to encourage people pasting in long Zalgo text. What I've implemented should be good enough to not annoy other people too much and to show a potential hacker that we've at least done some work in this area.