elm / core

Elm's core libraries
http://package.elm-lang.org/packages/elm/core/latest
BSD 3-Clause "New" or "Revised" License
2.79k stars 359 forks source link

Invalid or incomplete documentation in many String functions (characters vs UTF-16 code units) #1061

Open malaire opened 4 years ago

malaire commented 4 years ago

Documentation of these functions says that they work with characters, but actually they work with UTF-16 code units:

These functions work with characters, but this is not documented:

These functions work with UTF-16 code units, but this is not documented:

These functions work with characters and this is documented (no problem here, except that it's inconsistent for some functions to work with characters while others work with UTF-16 code units):

Related: #977

CSDUMMI commented 4 years ago

Is there any change to the output, that a function using UTF-16 code units would produce and a function using characters, which would do essentially the same thing?

CSDUMMI commented 4 years ago

Especially with String.length, is the length not the number of characters?

malaire commented 4 years ago

@CSDUMMI String.length is not the number of characters in all cases.

Only Unicode characters from "Basic Multilingual Plane" (Plane 0) are encoded as single UTF-16 code unit. All other Unicode characters are encoded as two UTF-16 code units, so if string contains any of these characters then number of UTF-16 code units is not the same as number of characters.

For example many emojis are from Plane 1, and so e.g. String.length "👩" returns 2 even through that is a single character.

More information about Unicode Planes: https://en.wikipedia.org/wiki/Plane_(Unicode)

CSDUMMI commented 4 years ago

Then I think you should open a PR with more or less this information added to the docs. This might be very relevant to some apps. Could a text editor, that counts the number of characters in a file not use this function?

malaire commented 4 years ago

@CSDUMMI Elm only accepts PRs from core contributors, so it would be useless for me to make a PR. It would never be accepted.

rlefevre commented 4 years ago

@malaire There are 96 contributors to this library. I would like Elm to have 96 core contributors but this is not the case. Some contributors PRs have been merged just a few commits ago. It's particularly true for typo fixes, but not only, look at at closed and merged pull requests.

That said, there is no guarantee, it can take time, and it's true that PRs are more often used as inspiration for the actual fixes than merged as is. That does not mean that there are not useful. For bug fixes, a complete and rigorous analysis might be more useful than the actual PR though.

malaire commented 4 years ago

I just got an idea of how to fix this issue. Someone could create a package, called for example InsaneString, which mirrors String exacly and documents properly just how insane this mix of Unicode Characters and Unicode CodePoints is.

Then if Elm project values outside contributions, documentation could be just copied from InsaneString to String. And if Elm project spits on outside contributions as I've read, then documentation would still be available for those who want it.

I'll consider creating this myself in time, but for now I have other projects going on, for example SafeInt which fixes Elm Int.

evancz commented 3 years ago

Fixing this would definitely require serious breaking changes, so it can only be done in a MAJOR release. I understand that the current behavior is confusing, but I also want to be careful not to have frequent MAJOR changes since that has cascading effects across the whole package ecosystem.

This issue was opened after 0.19.1 came out, and I plan revisit these problems is when a 0.19.2 or 0.20.0 is slated to release. A patch could potentially be made to document this better though. Is there a PR for that?