elixir-lang / elixir

Elixir is a dynamic, functional language for building scalable and maintainable applications
https://elixir-lang.org/
Apache License 2.0
24.28k stars 3.34k forks source link

String.strip does not strip non-breaking space #4300

Closed sikanhe closed 8 years ago

sikanhe commented 8 years ago

To reproduce:

String.strip "dfdf" <> <<160::utf8>>

whatyouhide commented 8 years ago

Upon some investigation, this happens with a bunch of UTF8 codepoints that should be whitespace as far as I understand:

iex(14)> String.rstrip(<<"foo", 0x00A0 :: utf8>>) == "foo"
false
iex(15)> String.rstrip(<<"foo", 0x202F :: utf8>>) == "foo"
false
iex(16)> String.rstrip(<<"foo", 0xFEFF :: utf8>>) == "foo"
false

The first two have a bidi of "CS", which is not considered whitespace (here's where Elixir looks at the bidi to determine what is whitespace); not sure what it means, but indeed not all codepoints with a bidi of CS are whitespace (just search for ;CS; in the UnicodeData.txt file).

The last one (0xFEFF) has a bidi of Cf, so still not considered whitespace. Again, another bunch of non-whitespace characters have this bidi but also a bunch of whitespace codepoints with this bidi, so not sure what to do :expressionless:.

josevalim commented 8 years ago

Yup, we only use the ones specified by Unicode in Bidi. If the consortium provides a separate file with whitespace characters, I will be glad to use them, but it seems are correct according to Unicode.

sikanhe commented 8 years ago

well this is annoying because sometimes user inputs will contain trailing whitespace that is not the normal space.

I would think that there is such no instance someone would want to only trim certain kind of white space not all other kinds. Or maybe if we can add a String.trim method that does more than String.strip?

josevalim commented 8 years ago

@sikanhe so use your own stripping function that strips whatever you want. The specification of String.strip is very clear on its purpose. I also don't understand the fixation in stripping those particular characters, there are hundreds of other unicode characters in user inputs that are not going to show as expected.

eksperimental commented 8 years ago

so how about letting String.strip/2 take a list of codepoints?

amuino commented 8 years ago

I have also been surprised by strip/1 not removing U+00A0 (the no-break space). The documentation for strip states that it removes «all leading and trailing Unicode whitespaces» … and U+00A0 is unicode whitespace. As an anecdote, similar functions in ruby also leave the no-break whitespace, while in nodejs it is removed.

So it seems that we either need to update the docs or the way Elixir recognises unicode whitespace. For example, see how Java documents the isWhitespace method, it is quite exhaustive.

I would be willing to spend some time writing a patch to really recognise unicode whitespace, if this is is something interesting for Elixir, or to reword the documentation for strip/1/split/1 so that it is less surprising when it does not remove some unicode whitespace.

If writing a patch, the consortium does provide PropList.txt which lists all characters with certain properties (the first being the White_Space property), so I would base the PR on parsing that file.

(note also that the same list of whitespace chars is used by split/1 and it might be surprising to split in a no-break whitespace, so we whould probably needto keep different lists for strip/1 and split/1)

For reference, this is a comparison of what Unicode documents as whitespace and what Elixir believes to be whitespace:

unicode | elixir
----------------
0009    | 0009 
000A    | 000A
000B    | 000B
000C    | 000C
000D    | 000D
        | 001C
        | 001D
        | 001E
        | 001F
0020    | 0020
0085    | 0085
00A0    |
1680    | 1680
2000    | 2000
2001    | 2001
2002    | 2002
2003    | 2003
2004    | 2004
2005    | 2005
2006    | 2006
2007    | 2007
2008    | 2008
2009    | 2009
200A    | 200A
2028    | 2028
2029    | 2029
202F    | 
205F    | 205F
3000    | 3000

So in Elixir we are missing 2 unicode whitespace characters and adding other 4 that are not unicode whitespace.

josevalim commented 8 years ago

Thank you @amuino. This is really complicated because it is coupled to what we consider to be whitespace. Unfortunately, that's not a static definition because it depends on the context. The property list you linked defines whitespace as follows:

Spaces, separator characters and other control characters which should be treated by programming languages as "white space" for the purpose of parsing elements.

Source: http://unicode.org/reports/tr44/#White_Space

I would argue that's not necessarily the context strip and split are being applied on. So I am personally more inclined in documenting on what we exactly consider as whitespace. Does it make sense?

amuino commented 8 years ago

I'm slightly biased towards using the unicode whitespace property for consistency, since it seems to be honoured by the Regex module when in unicode mode (in my head, if something is whitespace according to a language built-in, it should also for others).

iex(2)> Regex.match? ~r/\s/, <<160::utf8>>
false
iex(3)> Regex.match? ~r/\s/u, <<160::utf8>>
true

But there seems to be no consensus between programming languages, so updating the docs might suffice and is much simpler (no code, no backwards incompatibilities). I'll work on that 🙂.

josevalim commented 8 years ago

@amuino good point about the Regex. If we change the strip to use the proplist implementation, which version should we use for split?

amuino commented 8 years ago

Didn't really think it much… I would go for "unicode whitespace except no-break chars", which in practice means "except U+00A0" but if we want to generate no-break chars from a unicode table, I have not researched if one exists.

amuino commented 8 years ago

Oh, seems that "breakability" is present in UnicodeData.txt in the decomposition column we are currently ignoring, marked with the value <noBreak>.

josevalim commented 8 years ago

@amuino should we therefore strip on all whitespace UNLESS they have the nobreak property?

josevalim commented 8 years ago

The last sentence was an affirmation. :) Let's strip on all whitespaces unless they have the nobreak property. Can you please send a pull request? Btw, I don't think we need the whole PropData.txt file, only the whitespace parts.

amuino commented 8 years ago

Just to confirm: you mean strip and split both will work on unicode whitespace except no-break whitespace? (I was hopping for strip to work on all whitespace, since the no-break semantic only seems relevant when splitting)

josevalim commented 8 years ago

strip on all whitespace and split on all non-breaking whitespace. :)

sikanhe commented 8 years ago

strip on all whitespace and split on all non-breaking whitespace. :)

I think that is a good solution

amuino commented 8 years ago

PR submitted on #4389! Probably this issue can be closed now?

whatyouhide commented 8 years ago

@amuino let's wait to close the issue until #4389 gets merged :)

amuino commented 8 years ago

@whatyouhide 🙂merged!

whatyouhide commented 8 years ago

:clap: