ftilmann / latexdiff

Compares two latex files and marks up significant differences between them. Releases on www.ctan.org and mirrors
GNU General Public License v3.0
514 stars 72 forks source link

“encoding” Problem #195

Closed Key033 closed 1 year ago

Key033 commented 4 years ago

The version of latexdiff is

This is LATEXDIFF 1.3.0 (Algorithm::Diff 1.15 so, Perl v5.28.1) (c) 2004-2018 F J Tilmann Preamble Internal Type UNDERLINE Preamble Internal Type SAFE Preamble Internal Type FLOATSAFE

Working on Windows10 1909.

When I try to latexdiff the tex with the command like "latexdiff old.tex new.tex > diff.tex" or "latexdiff --encoding=utf8 old.tex new.tex > diff.tex", the "diff.tex" is encoded by UTF-16 LE, where the "old.tex" and "new.tex" are encoded by UTF-8. And the UTF-8 characters like Chinese and Japanese will be garbled.

For example, "old.tex"

\documentclass{article}
\usepackage[UTF8]{ctex}
\begin{document}
你好,这是一个测试文档。
\end{document}

"new.tex"

\documentclass{article}
\usepackage[UTF8]{ctex}
\begin{document}
你好,这是一个新的测试文档。
\end{document}

“diff.tex"

latex\documentclass{article}
%DIF LATEXDIFF DIFFERENCE FILE
%DIF DEL old.tex   Sat Apr  4 22:12:08 2020
%DIF ADD new.tex   Sat Apr  4 22:12:03 2020
\usepackage[UTF8]{ctex}
%DIF PREAMBLE EXTENSION ADDED BY LATEXDIFF
%DIF UNDERLINE PREAMBLE %DIF PREAMBLE
\RequirePackage[normalem]{ulem} %DIF PREAMBLE
\RequirePackage{color}\definecolor{RED}{rgb}{1,0,0}\definecolor{BLUE}{rgb}{0,0,1} %DIF PREAMBLE
\providecommand{\DIFadd}[1]{{\protect\color{blue}\uwave{#1}}} %DIF PREAMBLE
\providecommand{\DIFdel}[1]{{\protect\color{red}\sout{#1}}}                      %DIF PREAMBLE
%DIF SAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddbegin}{} %DIF PREAMBLE
\providecommand{\DIFaddend}{} %DIF PREAMBLE
\providecommand{\DIFdelbegin}{} %DIF PREAMBLE
\providecommand{\DIFdelend}{} %DIF PREAMBLE
\providecommand{\DIFmodbegin}{} %DIF PREAMBLE
\providecommand{\DIFmodend}{} %DIF PREAMBLE
%DIF FLOATSAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddFL}[1]{\DIFadd{#1}} %DIF PREAMBLE
\providecommand{\DIFdelFL}[1]{\DIFdel{#1}} %DIF PREAMBLE
\providecommand{\DIFaddbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFaddendFL}{} %DIF PREAMBLE
\providecommand{\DIFdelbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFdelendFL}{} %DIF PREAMBLE
%DIF LISTINGS PREAMBLE %DIF PREAMBLE
\RequirePackage{listings} %DIF PREAMBLE
\RequirePackage{color} %DIF PREAMBLE
\lstdefinelanguage{DIFcode}{ %DIF PREAMBLE
%DIF DIFCODE_UNDERLINE %DIF PREAMBLE
  moredelim=[il][\color{red}\sout]{\%DIF\ <\ }, %DIF PREAMBLE
  moredelim=[il][\color{blue}\uwave]{\%DIF\ >\ } %DIF PREAMBLE
} %DIF PREAMBLE
\lstdefinestyle{DIFverbatimstyle}{ %DIF PREAMBLE
    language=DIFcode, %DIF PREAMBLE
    basicstyle=\ttfamily, %DIF PREAMBLE
    columns=fullflexible, %DIF PREAMBLE
    keepspaces=true %DIF PREAMBLE
} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim}{\lstset{style=DIFverbatimstyle}}{} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim*}{\lstset{style=DIFverbatimstyle,showspaces=true}}{} %DIF PREAMBLE
%DIF END PREAMBLE EXTENSION ADDED BY LATEXDIFF

\begin{document}
浣犲ソ锛孿DIFdelbegin \DIFdel{杩欐槸涓€涓祴璇曟枃妗c€?
 }\DIFdelend \DIFaddbegin \DIFadd{杩欐槸涓€涓柊鐨勬祴璇曟枃妗c€?
 }\DIFaddend\end{document}
Key033 commented 4 years ago

I found if the old.tex and new.tex are encoded by UTF-8 with BOM, the diff.tex can be output with correct UTF8 characters and is encoded by UTF-16, which can be re-encoded to UTF-8 easily.

ftilmann commented 4 years ago

So is it solved? What is BOM?

Key033 commented 4 years ago

So is it solved? What is BOM?

The UTF-8 BOM is a sequence of Bytes at the start of a text-stream (0xEF,0xBB,0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8. Ref: https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom

I re-encoded the files by Vscode's "Save with Encoding" function.

And I think there is something wrong with the variable $encoding, but I haven't learned Perl.

ftilmann commented 4 years ago

Thanks for this report. The encoding is mostly dealt with by perl and (as you could see from my question) I have no real insight into the encoding. So I will not tackle this anytime soon but will leave the issue open in case anyone has an insight.

henrysky commented 2 years ago

I have just encountered this issue. You should use the good old CMD on Windows or PowerShell 6.2+ as the default Powershell in Windows 10/11 output file encoded with UTF-16 when you use >. Sometimes it is not as simple as re-encoding to UTF-8 as character like é in .tex file will turn to jibberish ├⌐ if using latexdiff on PowerShell <6.2 and cannot be recovered even re-encoding to UTF-8. I will say nothing is wrong with latexdiff or perl.

jonschz commented 1 year ago

Edit: The command below works, but also breaks utf-8 characters. I will stick with cmd and consider adding this to the FAQ.

You can use the following in powershell to get a utf-8 output file, but it will still break when there are non-standard characters in the .tex files.

latexdiff a.tex b.tex | Out-File output.tex -Encoding utf8
jonschz commented 1 year ago

Edit

The bigger issue seems to be that Powershell does not use Unicode to pipe the output from one command into another, see https://markw.dev/unicode_powershell/. I was able to get latexdiff to work in powershell using the following:

> [Console]::OutputEncoding = [System.Text.Encoding]::UTF8
> latexdiff .\latex_test_files\utf8_a.tex .\latex_test_files\utf8_b.tex | Out-File -Encoding utf8 out.tex

I would still recommend using cmd instead, and I will work on the pull request now.

Original text

Addendum: It appears that this is known problem with Perl in general under Windows.

See e.g. https://stackoverflow.com/a/66281302 and https://github.com/StrawberryPerl/Perl-Dist-Strawberry/issues/18.

See also https://stackoverflow.com/q/4942305; many other languages like Python and Node.js have since solved this issue.

I messed around a bit in Perl, tried some things, but it seems like there is no working pure-Perl solution. It seems like the Perl developers cannot easily change this, either, as it will break legacy code.

Solution for now

it seems to be best to just use cmd under Windows. Maybe I'll create a pull request to update the documentation.

Future

I have two ideas how one could mitigate this problem:

  1. One could implement direct output to files like latexdiff --outfile=out.tex a.tex b.tex. I suspect this will be quite a bit of work to implement, though.
  2. Another (hypothetical) possiblity is to modify the latexdiff.exe wrapper to fix the output. Not sure how complicated that will be.
sgbaird commented 1 year ago

xref: https://tex.stackexchange.com/questions/542161/error-in-texstudio-when-using-latexdiff-on-windows-10#comment1652779_542161