Normalising space and invisible characters in names.

Zegnat commented 11 years ago

Currently most invisible characters are kept at bay by the trim update, but this is only effecting whitespace characters and invisible characters in front and behind a name. It does not keep people from inserting invisible characters within a name to impersonate other users.

What we need would be a way to remove weird characters from within a name without hindering people in using whatever they need in their own languages.

The Unicode FAQ on displaying unsupported characters gives us some tips on what characters may be replaced with just a space or should not be replaced with anything.

I propose we try to list these characters and then replace them with the optimal sollution.

Characters that can be replaced

According to the above mentioned FAQ any character with the White_Space property set can be replaced with just a space. This would be the following list (taken from PropList.txt):

0009..000D    ; White_Space # Cc   [5] <control-0009>..<control-000D>
0020          ; White_Space # Zs       SPACE
0085          ; White_Space # Cc       <control-0085>
00A0          ; White_Space # Zs       NO-BREAK SPACE
1680          ; White_Space # Zs       OGHAM SPACE MARK
180E          ; White_Space # Zs       MONGOLIAN VOWEL SEPARATOR
2000..200A    ; White_Space # Zs  [11] EN QUAD..HAIR SPACE
2028          ; White_Space # Zl       LINE SEPARATOR
2029          ; White_Space # Zp       PARAGRAPH SEPARATOR
202F          ; White_Space # Zs       NARROW NO-BREAK SPACE
205F          ; White_Space # Zs       MEDIUM MATHEMATICAL SPACE
3000          ; White_Space # Zs       IDEOGRAPHIC SPACE

All of these are correctly stripped by the trimming function but for higher security we can exchange them for a single space character within names.

Characters that can be removed

According to the FAQ all ‘default-ignorable characters’ can be removed unless we choose to specificly allow them. The problem is, I haven’t been able to find a complete list yet. They provide us the following:

cursive joiners

200C..200D    ; Join_Control # Cf   [2] ZERO WIDTH NON-JOINER..ZERO WIDTH JOINER

bidirectional format controls

200E..200F    ; Pattern_White_Space # Cf   [2] LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK

the soft hyphen

00AD          ; Hyphen # Cf       SOFT HYPHEN

word joiners
1. 2060, WORD JOINER,
2. FEFF, ZERO WIDTH NO-BREAK SPACE
the zero width space
1. 200B, ZERO WIDTH SPACE
invisible math operators

2061..2064    ; Other_Math # Cf   [4] FUNCTION APPLICATION..INVISIBLE PLUS

Jamo filler characters and variation selectors

I have no idea what these are. Sorry. But the example of a Jamo filler character is included in the following…

Apart from this list in the FAQ the PropList.txt also includes an Other_Default_Ignorable_Code_Point property. I’m guessing these can be stripped:

034F          ; Other_Default_Ignorable_Code_Point # Mn       COMBINING GRAPHEME JOINER
115F..1160    ; Other_Default_Ignorable_Code_Point # Lo   [2] HANGUL CHOSEONG FILLER..HANGUL JUNGSEONG FILLER
17B4..17B5    ; Other_Default_Ignorable_Code_Point # Mn   [2] KHMER VOWEL INHERENT AQ..KHMER VOWEL INHERENT AA
2065..2069    ; Other_Default_Ignorable_Code_Point # Cn   [5] <reserved-2065>..<reserved-2069>
3164          ; Other_Default_Ignorable_Code_Point # Lo       HANGUL FILLER
FFA0          ; Other_Default_Ignorable_Code_Point # Lo       HALFWIDTH HANGUL FILLER
FFF0..FFF8    ; Other_Default_Ignorable_Code_Point # Cn   [9] <reserved-FFF0>..<reserved-FFF8>
E0000         ; Other_Default_Ignorable_Code_Point # Cn       <reserved-E0000>
E0002..E001F  ; Other_Default_Ignorable_Code_Point # Cn  [30] <reserved-E0002>..<reserved-E001F>
E0080..E00FF  ; Other_Default_Ignorable_Code_Point # Cn [128] <reserved-E0080>..<reserved-E00FF>
E01F0..E0FFF  ; Other_Default_Ignorable_Code_Point # Cn [3600] <reserved-E01F0>..<reserved-E0FFF>

This will probably get us somewhere in terms of protecting against name spoofing. There might be more invisible characters, but you have to start somewhere.

Kroc commented 11 years ago

I think this means we will also have to validate the UTF-8 bytes so as to remove invalid code points and byte-ranges. One could insert invalid bytes that string parsers normally skip over.

Edit: Going to add that as another bug since it will apply to all strings.

Zegnat commented 11 years ago

While I think we will not use it, posting this for completion sake:

previously we discussed taking an example of IDN blacklisting, Mozilla compiled a blacklist of sorts with 107 characters that should not be allowed: network.IDN.blacklist_chars.

Kroc commented 11 years ago

Looking at the property list, there's such a number of possibilities to spoof by many means other than spaces; if we are to allow Unicode user names then normalising spaces is essentially pissing in to the wind. I can see the sense in normalising spaces besides, but perhaps something much broader is needed as well.

We could perhaps try and limit allowed username characters to the character class, normalise the spaces and strip anything remaining so that, whilst letter-spoofing is still possible with different character sets, we can at least guarantee that there's no errant punctuation/space/unusual bytes in there.

Kroc / NoNonsenseForum

Normalising space and invisible characters in names. #175

Characters that can be replaced

Characters that can be removed