decaporg / decap-cms

A Git-based CMS for Static Site Generators
https://decapcms.org
MIT License
17.93k stars 3.04k forks source link

Emoji in front-matter fields should not get converted to Unicode surrogate pairs #577

Closed AnthoUCAYA closed 7 years ago

AnthoUCAYA commented 7 years ago

- Do you want to request a feature or report a bug? Report a bug

- What is the current behavior? In CMS, if there is an emoji in md file, after save, emoji are replaced by unicode characters and then Netlify deploy failed

- If the current behavior is a bug, please provide the steps to reproduce. Go to Admin Select a page Add an emoji in a field of your page Save your page

When Netlify try to deploy site : 1:44:17 PM: ERROR: 2017/09/03 11:44:17 page.go:1309: Error parsing page meta data for mypage.md 1:44:17 PM: ERROR: 2017/09/03 11:44:17 page.go:1310: yaml: line 1: found invalid Unicode character escape code 1:44:17 PM: ERROR: 2017/09/03 11:44:17 page.go:683: yaml: line 1: found invalid Unicode character escape code 1:44:17 PM: Error: Error building site: Errors reading pages: Error:yaml: line 1: found invalid Unicode character escape code for mypage.md

- What is the expected behavior? Don’t replace emoji and deploy without errors

- Please mention your node.js, and operating system version. Windows 10 entreprise

tech4him1 commented 7 years ago

@AnthoUCAYA What specific emoji did you use? This seems to be working for me. Also, are you using Hugo or a different site generator?

Edit: I was able to reproduce the emoji problem with Hugo.

RomuxX commented 7 years ago

Hi @tech4him1 I'm a coworker of @AnthoUCAYA, yes we using Hugo.

I think it's from yaml-js librairy

tech4him1 commented 7 years ago

The deploy error that you are getting looks like it is directly from Hugo, not from Netlify. For me Hugo seems to have errors with some of the newer Emoji, but not all of them. The CMS does seem to be outputting valid Unicode escape sequences, though, so I'm thinking this is a problem with Hugo itself, you might try making an issue there.

If you still think it is a problem with the CMS, can you give me the exact emoji that you are using, and the unicode string that the CMS is outputting?

AnthoUCAYA commented 7 years ago

Hi, For example this emoji 😊 was convert in these characters \uD83D\uDE0A This emoji was in a field of the md file but not in the body.

Tell me if you want further tests

AnthoUCAYA commented 7 years ago

Just a precison, emoji was in a field of the md file but not in the body.

tech4him1 commented 7 years ago

I don't believe that this is really a bug with the CMS, because it actually is valid Unicode to output 😊 as \uD83D\uDE0A (reference: https://mathiasbynens.be/notes/javascript-unicode and http://www.yaml.org/spec/1.2/spec.html#id2770814). YAML parsers are supposed to support UTF-8 and UTF-16 (see http://www.yaml.org/spec/1.2/spec.html#id2771184), but the underlying Hugo library, go-yaml, does not support UTF-16 surrogate pairs, so I have filed an issue here to hopefully get that resolved: https://github.com/go-yaml/yaml/issues/279.

It does seem to be something that we could change in the CMS to provide greater interoperability, though, so I am going to leave this issue open for now to that effect. If you think I have misunderstood this problem, though, please let me know.

tech4him1 commented 7 years ago

@erquhart I'm wondering if we should try to output Emoji characters directly, or with 8-digit escaped Unicode sequences instead of 4-digit ones (\U0001F60A instead of \uD83D\uDE0A), just to make it more interoperabile. We would have to work with js-yaml upstream, though, since before ES6 you couldn't actually get anything from a string except the surrogate pairs, so that is all they had to work with in that library.

Surrogate Pairs Information: https://mathiasbynens.be/notes/javascript-unicode

tech4him1 commented 7 years ago

js-yaml astral character encoding as other than surrogate pairs: nodeca/js-yaml#368

tech4him1 commented 7 years ago

@AnthoUCAYA We are working with js-yaml to get these encoded in a more standard way. Here is the PR if you want to test it: nodeca/js-yaml#369.

tech4him1 commented 7 years ago

Here is an explanation from Leon Timmermans on the YAML-core mailing list:

Javascript (and a few other languages with UTF-16 implementation details leaking out) has a tendency to treat such characters as two surrogates ("\uD83D\uDCA9"), instead of as a single character ("\U0001F4A9"). Quite frankly I think this is unhelpful and wrong, but JSON actually made it a standard -_-.

The YAML spec explicitly bans literal surrogate pairs, but is silent on escaped surrogates. Nothing in it suggests they are supported, except the suggestion of JSON compatibility. \U on the other hand is required to be supported. I don't think putting a literal astral printable character is erroneous, but quoted is probably safer whenever possible.

AnthoUCAYA commented 7 years ago

@tech4him1,

Thanks for this fix but when I changed the version of netlify-cms to "0.5.0-beta.10" in my package.json, I still have the same problem, it is the right version to use ?

tech4him1 commented 7 years ago

@AnthoUCAYA No, it hasn't been released yet. It will be in the next beta or version if we are ready (0.5.0-beta.11 or 0.5.0 ).

AnthoUCAYA commented 7 years ago

@tech4him1, Ok thanks !

AnthoUCAYA commented 7 years ago

@tech4him1,

I just test version 0.5.0 and it's ok ! Thanks