Closed eshurakov closed 3 years ago
This looks great @eshurakov ! Can you please resolve the conflict due to one of the other PRs?
Thank you for looking into this PR @ajitk! I've resolved the conflicts.
p.s. thank you for bringing diff-match-patch library to rust 🎉
Thank you @eshurakov for helping us make this better. Cheers!
What is the length of "🅰"? Well, it depends...
1
2
(it is a surrogate pair)Most diff-match-patch libraries use UTF-16 under the hood to represent strings which brings the compatibility issue of using deltas generating by one library in another library.
For example, running a diff "🅰🅱" -> "🅱" in python will produce the following delta:
-2\t=2
, but this library operates on scalar values, which makes it impossible to interpret this delta. Delta produced by this library would be-1\t=1
.To make things more complex libraries that perform diff on UTF-16 encoded strings sometimes split surrogate pairs, which may result in invalid string.
For example, running a diff "🅰" -> "🅱". In UTF16 it is "0xD83C 0xDD70" -> "0xD83C 0xDD71", the difference is only the second code point and percent encoded delta is
=1\t-1\t+%ED%B5%B1
. It splits the surrogate pair and parts of surrogate pair on their own are not valid unicode scalars.In this PR:
percent_decode_u16
method performs a custom forgiving percent decoding keeping split surrogate pairs. Percent encoding from the library throws an error because part of a surrogate pair is not a valid character.diff_todelta
anddiff_from_delta
that acceptLengthUnit
and try to encode/decode delta using specified unit.diff_from_delta
runs in UTF-16 mode and string decode fails, it will make an additional attempt to recover destination string and if successful to run a new diff.It would be awesome if other libraries also operate on unicode scalar values like this one, but unfortunately they are not.
p.s. sorry for spamming you with PRs 🙇