avh4 / elm-format

elm-format formats Elm source code according to a standard set of rules based on the official Elm Style Guide
BSD 3-Clause "New" or "Revised" License
1.31k stars 145 forks source link

Should zero-width spaces (\x200B) be handled differently? #452

Open scottwillmoore opened 6 years ago

scottwillmoore commented 6 years ago

maintainer edit: original title: សួស្តី​ពិភពលោក formated as សួស្តី\x200Bពិភពលោក

Hi,

I was working on a small Elm app, which prints "Hello World!" in various langauges. I am not a language expert, but it appears that Khmer and Lao (examples below) format with escape codes.

For clarity - I believe it would be better if it preserved the original unicode characters (as it now makes the code ugly, and harder to read - not that I am able to read Khmer or Lao!!).

Here is before elm-format was run.

, Greeting "Khmer" "សួស្តី​ពិភពលោក!"
, Greeting "Kyrgyz" "Салам дүйнө!"
, Greeting "Lao" "ສະ​ບາຍ​ດີ​ຊາວ​ໂລກ!"

Here is after elm-format was run.

, Greeting "Khmer" "សួស្តី\x200Bពិភពលោក!"
, Greeting "Kyrgyz" "Салам 
, Greeting "Lao" "ສະ\x200Bບາຍ\x200Bດີ\x200Bຊາວ\x200Bໂລກ!"

In case your interested in the source code for contextual information, you may find it at scottwillmoore/elm-hello.

I am using elm-format installed using yarn global add elm-format. The version I am using appears to be elm-format-0.18 0.6.1-alpha. I am formatting through the VSCode plugin vscode-elm.

avh4 commented 6 years ago

\x200B is the Zero-width space character http://www.fileformat.info/info/unicode/char/200B/index.htm, so in your example, the original text you pasted in apparently contained those spaces between words. elm-format currently escapes all whitespace characters except normal spaces to help avoid confusion and to make unusual characters apparent.

If someone has more details about how zero-width spaces are typically used in languages that need them, I'd be happy to consider a better solution, but as it currently stands, I think the current behavior is desired so that unusual whitespace in strings is apparent to people reading the code.

What do you think? Are your strings meant to have zero-width spaces in them? If so, how do you see whether or not you have them in the right place?

scottwillmoore commented 6 years ago

Ah of course - I should've looked up the Unicode character earlier.

The various "Hello World!" greetings are sources from this codegolf page on StackExchange. As for the purpose of the Zero-width space, I also was unsure at first - as I am not fluent in these languages. However, after a little bit of Googling I was able to dig up the following.

The Wikipedia article on the Zero-width space states:

The zero-width space (ZWSP) is a non-printing character used in computerized typesetting to indicate word boundaries to text processing systems when using scripts that do not use explicit spacing, or after characters (such as the slash) that are not followed by a visible space but after which there may nevertheless be a line break. Normally, it is not a visible separation, but it may expand in passages that are fully justified.

Further research yielded the following explanation from this bug at The Document Foundation:

Since the Lao written language does not word wrap properly, scripts have been developed for windows and MS word to automatically insert a Zero Width Space (U+200B) between each word. These are indivisible in MS Word, but show up as small forward slashes with a dark grey background in Libreoffice. (I have observed this over several years on Mac, Windows and Linux. Currently I am on a Mac with Libreoffice 4.0.1.2 (wish there were automatic updates).

TL;DR. In summary, for some langauges, the Zero-width space is important - and acts as a marker to assist in separating words, so that word-wrapping can still occur.

Unless someone wants to argue otherwise - I believe that the behaviour of the formatter is correct, as since this is a Zero-width space. Regardless of the context - the Zero-wdith space is invisible within most text editors and should hence be explictly typed to avoid confusion.