bastianbin / google-diff-match-patch

Automatically exported from code.google.com/p/google-diff-match-patch
Apache License 2.0
0 stars 0 forks source link

UnicodeDecodeError in Python version when using Cyrillic characters. #9

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Want to use Cyrillic characters with diff_match_patch (python version,
release), but got errors like:
"UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 0:
unexpected end of data"

appending in some places to strings ".decode("utf-8").encode("utf-8")",
seem to solve the problems, but I guess not 100%.

see the attached patch (and for any case new file).

Alexandr.

Original issue reported on code.google.com by sashul...@gmail.com on 11 May 2008 at 1:12

Attachments:

GoogleCodeExporter commented 8 years ago
Thank you Alexandr for the bug report and the patches.  Sorry for the delay.  I 
have
fixed the Unicode issues in diff_fromDelta and patch_fromText.  In both cases I 
added:
    if type(text) == unicode:
      text = text.encode("ascii")
These are two functions which are expecting a subset of ASCII characters.

However, your patch also made changes to diff_text1, diff_text2, patch_apply
and patch_obj.__str__.  Despite many tests, I am unable to find scenarios where
the existing code fails when passed Unicode.  An example testcase would be most
apreciated.

In the mean time, I've pushed out a new version which includes the Unicode 
fixes for
diff_fromDelta and patch_fromText in the Python version, as well as a new unit 
test
in all three versions which verifies the behaviour of invalid Unicode sequences 
(e.g.
%c3%xy).

Original comment by neil.fra...@gmail.com on 14 May 2008 at 7:47