distill-io / diff-match-patch.rs

diff-match-patch for rust
16 stars 9 forks source link

Decode and encode delta using different length units #10

Closed eshurakov closed 3 years ago

eshurakov commented 3 years ago

What is the length of "🅰"? Well, it depends...

Most diff-match-patch libraries use UTF-16 under the hood to represent strings which brings the compatibility issue of using deltas generating by one library in another library.

For example, running a diff "🅰🅱" -> "🅱" in python will produce the following delta: -2\t=2, but this library operates on scalar values, which makes it impossible to interpret this delta. Delta produced by this library would be -1\t=1.

To make things more complex libraries that perform diff on UTF-16 encoded strings sometimes split surrogate pairs, which may result in invalid string.

For example, running a diff "🅰" -> "🅱". In UTF16 it is "0xD83C 0xDD70" -> "0xD83C 0xDD71", the difference is only the second code point and percent encoded delta is =1\t-1\t+%ED%B5%B1. It splits the surrogate pair and parts of surrogate pair on their own are not valid unicode scalars.

In this PR:

It would be awesome if other libraries also operate on unicode scalar values like this one, but unfortunately they are not.

p.s. sorry for spamming you with PRs 🙇

ajitk commented 3 years ago

This looks great @eshurakov ! Can you please resolve the conflict due to one of the other PRs?

eshurakov commented 3 years ago

Thank you for looking into this PR @ajitk! I've resolved the conflicts.

p.s. thank you for bringing diff-match-patch library to rust 🎉

ajitk commented 3 years ago

Thank you @eshurakov for helping us make this better. Cheers!