Rosey / markdown-draft-js

A tool to convert content created in DraftJS to markdown and vice versa.
https://rosey.github.io/markdown-draft-js/
MIT License
318 stars 69 forks source link

Fix handling of emoji #47

Closed benjie closed 6 years ago

benjie commented 6 years ago

Hey, great project!

I discovered an issue where when you perform markdownToDraftJS on markdown that contains emoji, all the DraftJS inline styles become offset, e.g.:

Testing 👍 _italic_ words words words **bold** words words words

(Testing \ud83d\udc4d _italic_ words words words **bold** words words words)

becomes:

screen shot 2018-01-22 at 12 44 39

I've submitted this PR to fix this issue, when reviewing I recommend reviewing the commits one at a time - note that the second commit is just a direct copy from fbjs's UnicodeUtils.js which I then tweak in the next commit to make any changes explicit.

I include below a longer explanation of the issue, mostly so I remember exactly how I figured all this out.


The reason for this issue is that DraftJS treats length of the 👍 as 1 (it is 1 character) whereas JavaScript strings treat it has having length 2 due to the surrogate pair ( "\ud83d\udc4d".length === 2 / "👍".length === 2).

To see this in their tests, check out the test for decodeInlineStyleRanges:

    text: 'Take a \uD83D\uDCF7 #selfie',
    inlineStyleRanges: [
      {offset: 4, length: 4, style: 'BOLD'},
      {offset: 6, length: 8, style: 'ITALIC'},
    ],

and the resulting snapshot:

Array [
  Array [],
  Array [],
  Array [],
  Array [],
  Array [
    "BOLD",
  ],
  Array [
    "BOLD",
  ],
  Array [
    "BOLD",
    "ITALIC",
  ],
  Array [
    "BOLD",
    "ITALIC",
  ],
  Array [
    "BOLD",
    "ITALIC",
  ],
  Array [
    "ITALIC",
  ],
  Array [
    "ITALIC",
  ],
  Array [
    "ITALIC",
  ],
  Array [
    "ITALIC",
  ],
  Array [
    "ITALIC",
  ],
  Array [
    "ITALIC",
  ],
  Array [],
  Array [],
]

Here you can see that because BOLD and ITALIC both overlap the emoji, even though their lengths are 4 and 8 respectively the each affect 5 and 9 elements in the resulting array respectively.

This is because in decodeInlineStyleRanges, they use substr from UnicodeUtils to parse the ranges.

To solve this, we need to use strlen(content) from UnicodeUtils to calculate the string lengths rather than using content.length. Frustratingly UnicodeUtils is in the fbjs package which explicitly says to not depend on it because it may break. For this reason (and since it's MIT licensed so shouldn't be an issue) I've copied over the one file UnicodeUtils.js into this project and extremely lightly edited it so it doesn't add any dependencies. I've also added a failing test to reproduce the issue which I've then made pass using UnicodeUtils.strlen.

Rosey commented 6 years ago

Awesome thank you! I've dealt with this gotcha in the past and I think there's a solution that shouldn't require adding Unicode.js - I'm just on my phone right now so I can't test to confirm but using Array.from(string).length ? (Documented at https://mathiasbynens.be/notes/javascript-unicode)

benjie commented 6 years ago

I considered that from this post but it's an ES6 feature so I figured we'd have to add a substitute implementation anyway and I feared there may be unicode strings that differ in length in Array.from vs strlen. Ultimately I figured it'd be safer to use DraftJS's internal implementation. (There must be a reason they don't use Array.from? Maybe it's just that it requires polyfilling?)

I just tried a couple emoji that return length 2 from Array.from, and they also have length 2 with strlen:

    console.log(Array.from('❤️').length); // 2
    console.log(strlen('❤️')); // 2
    console.log(Array.from('👍🏻').length); // 2
    console.log(strlen('👍🏻')); // 2

This isn't an exhaustive test though, and I'm not sufficiently knowledgeable about Unicode to make this determination. Happy to make the change if you're confident though!

benjie commented 6 years ago

The example that post gives with length 8 / 7 also works fine with strlen / Array.from:

const emoji = '\ud83d\udc69\u200d\u2764\ufe0f\u200d\ud83d\udc8b\u200d\ud83d\udc69';
console.log(emoji); // 👩‍❤️‍💋‍👩
console.log(Array.from(emoji).length); // 8
console.log(strlen(emoji)); // 8
Rosey commented 6 years ago

Sorry @benjie I don't mean to leave you hanging, just have limited free time these days 🙃 This is my first time opening my laptop in weeks! And it will only last about 5 minutes...

but it's an ES6 feature so I figured we'd have to add a substitute implementation anyway

I think we're fine without that; the library already uses a few ES6 features and has a disclaimer in the readme about the need for a polyfill!

So yeah I'm happy to just use Array.from! Keeps things smaller and simpler and I'm pretty sure it should work in all cases 🤞 (I am also not a unicode expert but that's just based on my limited experience on other projects where I saw this come up)

benjie commented 6 years ago

Cool; I’ll get it done over the next week hopefully! 👍 No worries about delay; I saw your message in the readme 😄

benjie commented 6 years ago

Closing this so we can easily restore fbjs implementation if it turns out we need it. Opening a new PR with the simplified code: #48