ftilmann / latexdiff

Compares two latex files and marks up significant differences between them. Releases on www.ctan.org and mirrors
GNU General Public License v3.0
529 stars 75 forks source link

latexdiff crashes with "utf8 "\xF6" does not map to Unicode at /usr/local/texlive/2016/bin/x86_64-linux/latexdiff line 1409, <FILE> chunk 1." #78

Open joha1 opened 7 years ago

joha1 commented 7 years ago

I have a fresh install of texlive 2016 (as described on https://www.tug.org/texlive/quickinstall.html, not via my distros (CentOS7) package manager) and when I try to compare two files, it crashes, starting with the above error:

$ latexdiff final1.tex difftest.tex > diffoutput.tex
utf8 "\xF6" does not map to Unicode at /usr/local/texlive/2016/bin/x86_64-linux/latexdiff line 1409, <FILE> chunk 1.
utf8 "\xF6" does not map to Unicode at /usr/local/texlive/2016/bin/x86_64-linux/latexdiff line 1409, <FILE> chunk 1.
Malformed UTF-8 character (unexpected non-continuation byte 0x6e, immediately after start byte 0xf6) in substitution iterator at /usr/local/texlive/2016/bin/x86_64-linux/latexdiff line 2331.
Malformed UTF-8 character (unexpected non-continuation byte 0x62, immediately after start byte 0xfc) in substitution iterator at /usr/local/texlive/2016/bin/x86_64-linux/latexdiff line 2331.
Malformed UTF-8 character (unexpected non-continuation byte 0x68, immediately after start byte 0xf6) in substitution iterator at /usr/local/texlive/2016/bin/x86_64-linux/latexdiff line 2331.
Malformed UTF-8 character (unexpected non-continuation byte 0x62, immediately after start byte 0xfc) in substitution iterator at /usr/local/texlive/2016/bin/x86_64-linux/latexdiff line 2331.
Malformed UTF-8 character (unexpected non-continuation byte 0x68, immediately after start byte 0xe4) in substitution iterator at /usr/local/texlive/2016/bin/x86_64-linux/latexdiff line 2331.

…and so on, where it seems to repeat the pattern of 0x6e, 0x62, 0x68, 0x62, 0x68, 0x62, 0x68, 0x6e for a few dozen times, until it switches to line 2332, just to repeat the same pattern for a while, then with some other lines, until it stops with

Malformed UTF-8 character (unexpected continuation byte 0x85, with no preceding start byte) in pattern match (m//) at /usr/local/texlive/2016/bin/x86_64-linux/latexdiff line 2274.
Malformed UTF-8 character (unexpected continuation byte 0x85, with no preceding start byte) in pattern match (m//) at /usr/local/texlive/2016/bin/x86_64-linux/latexdiff line 2274.
Malformed UTF-8 character (unexpected non-continuation byte 0x6e, immediately after start byte 0xf6) in pattern match (m//) at /usr/local/texlive/2016/bin/x86_64-linux/latexdiff line 1752.
Malformed UTF-8 character (fatal) at /usr/local/texlive/2016/bin/x86_64-linux/latexdiff line 1752.

Looking at the latexdiff source file, I dont see anything weird in the line 1409 or 2331 and the following few and when I compare the one installed by texlive with the one available here, the lines in question are identical (there are just a few more comments in the github file around my line 2331/github line 2152).

Whats happening here and how do I get latexdiff to run?

ftilmann commented 7 years ago

Can you make your input files available (it must be possible for me to download the very exact bytes), ideally reduced to a minimum set that still reproduces the error. Copy/pasting into github will most likely leave me unable to reproduce the error. Also, please run latexdiff --version and post the output. I suspect that somehow the file uses a different character encoding than latexdiff (resp. perl) assumes. If you know the encoding you can specify it with --encoding option. The guessing algorithm in latexdiff is extremely simple so is not that reliable.

joha1 commented 7 years ago

I can't reproduce the exact error above since I lost the system in a total crash, but with my fedora system, I do get something similar.

I don't know why that did not occur to me earlier, but I just remembered that xF6 was hex for some Umlaut stuff (ö in thise case), which I am only using in comments (I'm a German speaker) since they often break stuff when I'm using them in the main text, instead of \"o, but for the sake of easy typing, I was using them in comments.

If I remove any Umlaute from the comments, it works.

However, I dont quite understand why latexdiff does not point me towards the offending line in my tex file. That way, it would have been obvious what's happening and how to fix it on my side.

Could make sense to add a warning about about the fact that "the guessing algorithm in latexdiff is extremely simple so is not that reliable."

ftilmann commented 7 years ago

If you find out your encoding (for example latin9), and you use \usepackage[latin9]{encoding} in the first 20 lines of your preamble (obviously replace latin9 as appropriate), then it should work (to be sure don't use Umlaute in the first 20 line) and you could also use Umlaute in the text instead of \"o . The problem is that the error is triggered by the regular expression scanning of perl, so do deal with this in latexdiff I would have to employ a proper error handling (e.g. Carp module), and somehow find out which character causes the problem (this is all within the perl regex matching). Given this effort and that the fix on the user side is fairly straightforward and that the improvement of such error handling only moderate (better error message) I will not do this (but leave the issue open for now to give other people the chance to comment)

ftilmann commented 7 years ago

Maybe you can try --encoding=ascii option to latexdiff (or the encoding your text uses, if you know it). If the offending characters are just in the comments I expect this to work (no guarantee, though) (don't combine this with the previous solution, though)

linzgood commented 6 years ago

D:\Program Files (x86)\CTEX\MiKTeX\scripts\latexdiff> latexdiff original.tex revised.tex>diff.tex

utf8 "\xA1" does not map to Unicode at D:\Program Files (x86)\CTEX\UserData\scripts\latexdiff\latexdiff line 1107, chunk 1.

Malformed UTF-8 character (fatal) at D:\Program Files (x86)\CTEX\UserData\scripts\latexdiff\latexdiff line 1387.

dcnicholls commented 6 years ago

Maybe you can try --encoding=ascii option to latexdiff (or the encoding your text uses, if you know it). If the offending characters are just in the comments I expect this to work (no guarantee, though) (don't combine this with the previous solution, though)

That worked for me for the same error.

peteflorence commented 5 years ago

+1 for --encoding=ascii solving my problem

by the way, thank you for developing latexdiff!