Open Zegnat opened 11 years ago
I think this means we will also have to validate the UTF-8 bytes so as to remove invalid code points and byte-ranges. One could insert invalid bytes that string parsers normally skip over.
Edit: Going to add that as another bug since it will apply to all strings.
While I think we will not use it, posting this for completion sake:
previously we discussed taking an example of IDN blacklisting, Mozilla compiled a blacklist of sorts with 107 characters that should not be allowed: network.IDN.blacklist_chars.
Looking at the property list, there's such a number of possibilities to spoof by many means other than spaces; if we are to allow Unicode user names then normalising spaces is essentially pissing in to the wind. I can see the sense in normalising spaces besides, but perhaps something much broader is needed as well.
We could perhaps try and limit allowed username characters to the character class, normalise the spaces and strip anything remaining so that, whilst letter-spoofing is still possible with different character sets, we can at least guarantee that there's no errant punctuation/space/unusual bytes in there.
Related to trimming unicode from start and end.
Currently most invisible characters are kept at bay by the trim update, but this is only effecting whitespace characters and invisible characters in front and behind a name. It does not keep people from inserting invisible characters within a name to impersonate other users.
What we need would be a way to remove weird characters from within a name without hindering people in using whatever they need in their own languages.
The Unicode FAQ on displaying unsupported characters gives us some tips on what characters may be replaced with just a space or should not be replaced with anything.
I propose we try to list these characters and then replace them with the optimal sollution.
Characters that can be replaced
According to the above mentioned FAQ any character with the White_Space property set can be replaced with just a space. This would be the following list (taken from PropList.txt):
All of these are correctly stripped by the trimming function but for higher security we can exchange them for a single space character within names.
Characters that can be removed
According to the FAQ all ‘default-ignorable characters’ can be removed unless we choose to specificly allow them. The problem is, I haven’t been able to find a complete list yet. They provide us the following:
2060
, WORD JOINER,FEFF
, ZERO WIDTH NO-BREAK SPACE200B
, ZERO WIDTH SPACEJamo filler characters and variation selectors
I have no idea what these are. Sorry. But the example of a Jamo filler character is included in the following…
Apart from this list in the FAQ the PropList.txt also includes an
Other_Default_Ignorable_Code_Point
property. I’m guessing these can be stripped:This will probably get us somewhere in terms of protecting against name spoofing. There might be more invisible characters, but you have to start somewhere.