ftilmann / latexdiff

Compares two latex files and marks up significant differences between them. Releases on www.ctan.org and mirrors
GNU General Public License v3.0
513 stars 72 forks source link

Better general support for CJK languages #229

Closed xlucn closed 2 years ago

xlucn commented 3 years ago

The issue is that the CJK characters enclosed in \DIFadd{} or \DIFdel{} do not wrap at the end of lines. This is the first time I use latexdiff so I am not sure if I am missing anything. I am using ctex package to provide the CJK environment.

I am using texlive-core package of version 2020.57066 on Arch Linux, so latexdiff here is not on the latest git version. Here is a minimal working example here.

The generated PDF file, where the issue is shown: templ-zh-diff.pdf This is the generated Tex file by latexdiff, the only difference I made is the line got longer by repeating the same phrase multiple times:

%! Tex program = xelatex
%DIF LATEXDIFF DIFFERENCE FILE
%DIF DEL templ-zh.tex       Sun Apr 25 16:36:07 2021
%DIF ADD templ-zh-new.tex   Sun Apr 25 16:37:20 2021
\documentclass{article}
\usepackage[fontset=none]{ctex}
\setCJKmainfont{Noto Serif CJK SC}
%DIF PREAMBLE EXTENSION ADDED BY LATEXDIFF
%DIF UNDERLINE PREAMBLE %DIF PREAMBLE
\RequirePackage[normalem]{ulem} %DIF PREAMBLE
\RequirePackage{color}\definecolor{RED}{rgb}{1,0,0}\definecolor{BLUE}{rgb}{0,0,1} %DIF PREAMBLE
\providecommand{\DIFadd}[1]{{\protect\color{blue}\uwave{#1}}} %DIF PREAMBLE
\providecommand{\DIFdel}[1]{{\protect\color{red}\sout{#1}}}                      %DIF PREAMBLE
%DIF SAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddbegin}{} %DIF PREAMBLE
\providecommand{\DIFaddend}{} %DIF PREAMBLE
\providecommand{\DIFdelbegin}{} %DIF PREAMBLE
\providecommand{\DIFdelend}{} %DIF PREAMBLE
\providecommand{\DIFmodbegin}{} %DIF PREAMBLE
\providecommand{\DIFmodend}{} %DIF PREAMBLE
%DIF FLOATSAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddFL}[1]{\DIFadd{#1}} %DIF PREAMBLE
\providecommand{\DIFdelFL}[1]{\DIFdel{#1}} %DIF PREAMBLE
\providecommand{\DIFaddbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFaddendFL}{} %DIF PREAMBLE
\providecommand{\DIFdelbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFdelendFL}{} %DIF PREAMBLE
%DIF LISTINGS PREAMBLE %DIF PREAMBLE
\RequirePackage{listings} %DIF PREAMBLE
\RequirePackage{color} %DIF PREAMBLE
\lstdefinelanguage{DIFcode}{ %DIF PREAMBLE
%DIF DIFCODE_UNDERLINE %DIF PREAMBLE
  moredelim=[il][\color{red}\sout]{\%DIF\ <\ }, %DIF PREAMBLE
  moredelim=[il][\color{blue}\uwave]{\%DIF\ >\ } %DIF PREAMBLE
} %DIF PREAMBLE
\lstdefinestyle{DIFverbatimstyle}{ %DIF PREAMBLE
    language=DIFcode, %DIF PREAMBLE
    basicstyle=\ttfamily, %DIF PREAMBLE
    columns=fullflexible, %DIF PREAMBLE
    keepspaces=true %DIF PREAMBLE
} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim}{\lstset{style=DIFverbatimstyle}}{} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim*}{\lstset{style=DIFverbatimstyle,showspaces=true}}{} %DIF PREAMBLE
%DIF END PREAMBLE EXTENSION ADDED BY LATEXDIFF

\begin{document}
  这是中文 \DIFaddbegin \DIFadd{这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文
 }\DIFaddend\end{document}

The new version PDF without latexdiff: templ-zh-new.pdf The new version Tex file:

%! Tex program = xelatex
\documentclass{article}
\usepackage[fontset=none]{ctex}
\setCJKmainfont{Noto Serif CJK SC}
\begin{document}
  这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文 这是中文
\end{document}

I compiled them with

latexmk -xelatex <texfile>
xlucn commented 3 years ago

Actually, there are more issues with CJK contents. It could be lacking of overall support of CJK language in latexdiff. Of course, again, maybe I missed something that could fix it.

I noticed that latexdiff does not recognize smaller blocks of differences. CJK sentences does not have spaces between the words, maybe because of this, latexdiff only marks the differences in the unit of a block that does not have a space nor other non-CJK letters.

As an example, if I have two tex files with only two sentences different:

一一一一一一一一一一一一
一一一一一一二一一一一一

latexdiff will give me

\DIFdelbegin \DIFdel{一一一一一一一一一一一一
 }\DIFdelend \DIFaddbegin \DIFadd{一一一一一一二一一一一一
 }\DIFaddend

instead of something like

一一一一一一\DIFdelbegin \DIFdel{一}\DIFdelend \DIFaddbegin \DIFadd{二}\DIFaddend一一一一一
ftilmann commented 2 years ago

Thank you for highlighting this deficiency. You reported two issues. I will answer to the second one first (about only marking differences in text). Actually I cannot see the characters in your second example, everything is replaced by dashed; it seems support for this encoding is lacking in either github or my browser. I can see the characters in the first post, though.

You are right, latexdiff was initially developed without CJK support in mind, as I don't have access to examples to develop this, and no understanding of the language conventions myself. But a similar issue arose for Japanese text, and with the help of another user, I believe this is now treated more correctly. It should work for Han script as well. Are your symbols Han, or another script? If there are Han, then it is likely this is fixed in the newer versions, and you just have to be patient. If they are not I would need your help. Look for the following lines in the source code of latexdiff:

  my $word_ja='\p{Han}+|\p{InHiragana}+|\p{InKatakana}+';
  my $word='(?:' . $word_ja . '|(?:(?:[-\w\d*]|\\\\[\"\'\`~^][A-Za-z\*])(?!(?:' . $word_ja . ')))+)';

(if you can't find them, then your version is definitely too old). You can add more symbols to the definition of $word_ja (should really be called $word_cjk I guess), but you would have to google what perl pattern matching string is describing this.

If you have the commands above in your latexdiff, your script is Han and it still does not work then let me know. As copy/paste in non-native encoding is tricky could I ask you to make MWE files available for download rather than copy/paste. Ideally provide me with old and new file, e.g. put into zip file and attach here.

ftilmann commented 2 years ago

On the first reported issue (overlong lines). Unfortunately this is a limitation with the ulem package which is used for underlining. You can try taking this to ulem maintainers, but there is nothing I can do about this. Workaround here is to choose another highlighting style with the -t option, e.g. CFONT.

xlucn commented 2 years ago

Thanks for the reply and effort! I will try the latest code later.

As for the character issues, there is nothing wrong with your browser or font at all. "一" is one and "二" is two in Chinese, respectively (also obviously). I thought that would showcase my expectations better. Apparently it has some pitfalls :)

xlucn commented 2 years ago

Tried the current code. It's not quite working for Chinese though.

First, the code change you mentioned is 6 years old. I am using the texlive-core package in Arch Linux, which updates frequently to include new versions of its packages. So, I was already using more recent version of latexdiff.

As for the code, I think it's only a small improvement to Japanese only. Japanese characters have three different sets: Katakana, Hiragana and Kanji, with the Kanji basically same as Chinese characters (I guess, hence \p{Han}). Thus, that change means a Japanese sentence will be split into words between different Japanese character sets, too.

However, Chinese characters are all treated the same in programs, so the result is the same whether the change is made or not.


I followed the code and tried the following change:

-  my $word_ja='\p{Han}+|\p{InHiragana}+|\p{InKatakana}+';
+  my $word_ja='\p{Han}|\p{InHiragana}|\p{InKatakana}';

Which should indicate it's now character-based word splitting for CJ(not K). I tried in my actual document, it's working quite well (I have to work out why the deletion fonts are smaller, though (Edit: Okay, that's just how CFONT works)):

Screenshot_2022-03-06_22-41-34

xlucn commented 2 years ago

@ftilmann, Apologies if I'm bombarding you with notifications/emails. I just found I am not the first coming up with this issue, there is another one regarding Japanese: #145.

Thus, I propose that my character-based word splitting strategy should be an option, either being default or not, instead of just changing the behavior for good. It's for the better if the new behavior is the default, since CJK does not split words by spaces at all. The user can choose the old behavior if they are not happy.

The matching regex needs improving, tho, it currently matches part of CJK, but not all.

ftilmann commented 2 years ago

OK, I have implemented this change now, i.e., character-based processing. From the two issues it seems to me that character-based processing is practically always going to be the right thing to do and the closest equivalent to word-wise processing in phonetic languages (two-characters 'words' are an edge case, but I think that's something unavoidable. So for now I just change the behaviour rather than introducing an option as suggested by you, The reason is that in the end many options make the program and manual quite complex. If people are complaining and want the old behaviour back I can still add the option.

Thanks you for checking this out and also reminding me of the old issue.
(Are you a seismologist? I noted with some help from Google Translate that the earlier sample text that you had shared (now edited out again) referred to P/S travel times and, I think, J-B tables.)

xlucn commented 2 years ago

Agreed.

Oh, the sample test is from https://github.com/ftilmann/latexdiff/commit/5ad40746e49a91fec7a9490a2ff485f347fa213f, I don't mean to remove the zip file along with comment text, though. I am Chinese, the sample is just me figuring out what the forementioned commit was doing :)