Closed sikanhe closed 8 years ago
Upon some investigation, this happens with a bunch of UTF8 codepoints that should be whitespace as far as I understand:
0x00A0
0x202F
0xFEFF
iex(14)> String.rstrip(<<"foo", 0x00A0 :: utf8>>) == "foo"
false
iex(15)> String.rstrip(<<"foo", 0x202F :: utf8>>) == "foo"
false
iex(16)> String.rstrip(<<"foo", 0xFEFF :: utf8>>) == "foo"
false
The first two have a bidi of "CS"
, which is not considered whitespace (here's where Elixir looks at the bidi to determine what is whitespace); not sure what it means, but indeed not all codepoints with a bidi of CS
are whitespace (just search for ;CS;
in the UnicodeData.txt file).
The last one (0xFEFF
) has a bidi of Cf
, so still not considered whitespace. Again, another bunch of non-whitespace characters have this bidi but also a bunch of whitespace codepoints with this bidi, so not sure what to do :expressionless:.
Yup, we only use the ones specified by Unicode in Bidi. If the consortium provides a separate file with whitespace characters, I will be glad to use them, but it seems are correct according to Unicode.
well this is annoying because sometimes user inputs will contain trailing whitespace that is not the normal space.
I would think that there is such no instance someone would want to only trim certain kind of white space not all other kinds. Or maybe if we can add a String.trim method that does more than String.strip?
@sikanhe so use your own stripping function that strips whatever you want. The specification of String.strip is very clear on its purpose. I also don't understand the fixation in stripping those particular characters, there are hundreds of other unicode characters in user inputs that are not going to show as expected.
so how about letting String.strip/2 take a list of codepoints?
I have also been surprised by strip/1
not removing U+00A0
(the no-break space). The documentation for strip
states that it removes «all leading and trailing Unicode whitespaces» … and U+00A0
is unicode whitespace. As an anecdote, similar functions in ruby also leave the no-break whitespace, while in nodejs it is removed.
So it seems that we either need to update the docs or the way Elixir recognises unicode whitespace. For example, see how Java documents the isWhitespace
method, it is quite exhaustive.
I would be willing to spend some time writing a patch to really recognise unicode whitespace, if this is is something interesting for Elixir, or to reword the documentation for strip/1
/split/1
so that it is less surprising when it does not remove some unicode whitespace.
If writing a patch, the consortium does provide PropList.txt
which lists all characters with certain properties (the first being the White_Space
property), so I would base the PR on parsing that file.
(note also that the same list of whitespace chars is used by split/1
and it might be surprising to split in a no-break whitespace, so we whould probably needto keep different lists for strip/1
and split/1
)
For reference, this is a comparison of what Unicode documents as whitespace and what Elixir believes to be whitespace:
unicode | elixir
----------------
0009 | 0009
000A | 000A
000B | 000B
000C | 000C
000D | 000D
| 001C
| 001D
| 001E
| 001F
0020 | 0020
0085 | 0085
00A0 |
1680 | 1680
2000 | 2000
2001 | 2001
2002 | 2002
2003 | 2003
2004 | 2004
2005 | 2005
2006 | 2006
2007 | 2007
2008 | 2008
2009 | 2009
200A | 200A
2028 | 2028
2029 | 2029
202F |
205F | 205F
3000 | 3000
So in Elixir we are missing 2 unicode whitespace characters and adding other 4 that are not unicode whitespace.
Thank you @amuino. This is really complicated because it is coupled to what we consider to be whitespace. Unfortunately, that's not a static definition because it depends on the context. The property list you linked defines whitespace as follows:
Spaces, separator characters and other control characters which should be treated by programming languages as "white space" for the purpose of parsing elements.
Source: http://unicode.org/reports/tr44/#White_Space
I would argue that's not necessarily the context strip and split are being applied on. So I am personally more inclined in documenting on what we exactly consider as whitespace. Does it make sense?
I'm slightly biased towards using the unicode whitespace property for consistency, since it seems to be honoured by the Regex
module when in unicode mode (in my head, if something is whitespace according to a language built-in, it should also for others).
iex(2)> Regex.match? ~r/\s/, <<160::utf8>>
false
iex(3)> Regex.match? ~r/\s/u, <<160::utf8>>
true
But there seems to be no consensus between programming languages, so updating the docs might suffice and is much simpler (no code, no backwards incompatibilities). I'll work on that 🙂.
@amuino good point about the Regex. If we change the strip to use the proplist implementation, which version should we use for split?
Didn't really think it much… I would go for "unicode whitespace except no-break chars", which in practice means "except U+00A0
" but if we want to generate no-break chars from a unicode table, I have not researched if one exists.
Oh, seems that "breakability" is present in UnicodeData.txt
in the decomposition
column we are currently ignoring, marked with the value <noBreak>
.
@amuino should we therefore strip on all whitespace UNLESS they have the nobreak property?
The last sentence was an affirmation. :) Let's strip on all whitespaces unless they have the nobreak property. Can you please send a pull request? Btw, I don't think we need the whole PropData.txt file, only the whitespace parts.
Just to confirm: you mean strip
and split
both will work on unicode whitespace except no-break whitespace? (I was hopping for strip
to work on all whitespace, since the no-break semantic only seems relevant when splitting)
strip
on all whitespace and split
on all non-breaking whitespace. :)
strip on all whitespace and split on all non-breaking whitespace. :)
I think that is a good solution
PR submitted on #4389! Probably this issue can be closed now?
@amuino let's wait to close the issue until #4389 gets merged :)
@whatyouhide 🙂merged!
:clap:
To reproduce:
String.strip "dfdf" <> <<160::utf8>>