Decode and encode delta using different length units

eshurakov commented 3 years ago

What is the length of "🅰"? Well, it depends...

if counting using Unicode Scalar Values (or char in rust), it is 1
if counting using UTF-16 code points, it is 2 (it is a surrogate pair)

Most diff-match-patch libraries use UTF-16 under the hood to represent strings which brings the compatibility issue of using deltas generating by one library in another library.

For example, running a diff "🅰🅱" -> "🅱" in python will produce the following delta: -2\t=2, but this library operates on scalar values, which makes it impossible to interpret this delta. Delta produced by this library would be -1\t=1.

To make things more complex libraries that perform diff on UTF-16 encoded strings sometimes split surrogate pairs, which may result in invalid string.

For example, running a diff "🅰" -> "🅱". In UTF16 it is "0xD83C 0xDD70" -> "0xD83C 0xDD71", the difference is only the second code point and percent encoded delta is =1\t-1\t+%ED%B5%B1. It splits the surrogate pair and parts of surrogate pair on their own are not valid unicode scalars.

In this PR:

percent_decode_u16 method performs a custom forgiving percent decoding keeping split surrogate pairs. Percent encoding from the library throws an error because part of a surrogate pair is not a valid character.
there is a new version of diff_todelta and diff_from_delta that accept LengthUnit and try to encode/decode delta using specified unit.
when diff_from_delta runs in UTF-16 mode and string decode fails, it will make an additional attempt to recover destination string and if successful to run a new diff.

It would be awesome if other libraries also operate on unicode scalar values like this one, but unfortunately they are not.

p.s. sorry for spamming you with PRs 🙇

ajitk commented 3 years ago

This looks great @eshurakov ! Can you please resolve the conflict due to one of the other PRs?

eshurakov commented 3 years ago

Thank you for looking into this PR @ajitk! I've resolved the conflicts.

p.s. thank you for bringing diff-match-patch library to rust 🎉

ajitk commented 3 years ago

Thank you @eshurakov for helping us make this better. Cheers!

distill-io / diff-match-patch.rs

Decode and encode delta using different length units #10