ftilmann / latexdiff

Compares two latex files and marks up significant differences between them. Releases on www.ctan.org and mirrors
GNU General Public License v3.0
513 stars 72 forks source link

Complex regular subexpression recursion limit (65534) exceeded #247

Closed ORippler closed 2 years ago

ORippler commented 2 years ago

When compiling latexdiff for large tikzpicture files (with lots of data), apparently the regex recursion limit is reached somewhere in the latexdiff script, leading to garbled / broken output.

MWE

tex-file 1:

\documentclass{article}

% PGFPlots is used for drawing some of the charts
\usepackage{pgfplots}
\usepgfplotslibrary{groupplots}
\usetikzlibrary{positioning}
\pgfplotsset{compat=newest}
\usepackage{array, tabularx}            
\pgfmathdeclarefunction{invgauss}{4}{%
  \pgfmathparse{(sqrt(-2*ln(#1))*cos(deg(2*pi*#2))*#4) + #3}%
}

\begin{document}
    \input{fig_new}
\end{document}

tex-file 2:

\documentclass{article}

% PGFPlots is used for drawing some of the charts
\usepackage{pgfplots}
\usepgfplotslibrary{groupplots}
\usetikzlibrary{positioning}
\pgfplotsset{compat=newest}
\usepackage{array, tabularx}            
\pgfmathdeclarefunction{invgauss}{4}{%
  \pgfmathparse{(sqrt(-2*ln(#1))*cos(deg(2*pi*#2))*#4) + #3}%
}

\begin{document}
    \input{fig_old}
\end{document}

Generating the diff between the two files via latexdiff --flatten $1 $2 > broken.tex leads to

Complex regular subexpression recursion limit (65534) exceeded at /usr/bin/latexdiff line 2220, <DATA> line 79167.
Complex regular subexpression recursion limit (65534) exceeded at /usr/bin/latexdiff line 2220, <DATA> line 79167.
Complex regular subexpression recursion limit (65534) exceeded at /usr/bin/latexdiff line 3297, <DATA> line 79167.

and a garbled output when compiling the generated tex file.

fig_new.txt fig_old.txt

Note that I have to append .txt to the figs used in the MWE to get them to upload to Github.

Possible solutions

According to the perldocs, we should refactor the affected regex to use a loop or other structures: https://perldoc.perl.org/perldiag#Complex-regular-subexpression-recursion-limit-(%25d)-exceeded

Also, since the figures are composed of largely tables with numbers, it might also be the case that a regex with nested quantifiers leads to unnecessary recursions/overlaps and could be optimized (note I haven't looked at the perl code in question since I am not familiar with perl): https://stackoverflow.com/a/34200869

Versions

root@15be59c1532c:/workdir# latexdiff --version
This is LATEXDIFF 1.3.1.1 (Algorithm::Diff 1.15 so, Perl v5.32.1)
  (c) 2004-2020 F J Tilmann

For 100% reproducibility also my used Dockerfile for latex:

# TL2017 was the TexLive version used before 2021-11-01, and can be specified via
# --build-arg BASE_IMAGE=registry.gitlab.com/islandoftex/images/texlive:TL2017-historic
ARG BASE_IMAGE=registry.gitlab.com/islandoftex/images/texlive:TL2021-2021-10-31-04-05
FROM $BASE_IMAGE
# need to speciy BASE_IMAGE again as it is outside the build context if appears before FROM
# https://docs.docker.com/engine/reference/builder/#understand-how-arg-and-from-interact
ARG BASE_IMAGE

# Also install mscorefonts & ghostscript (requried for e.g. Arial and compatibility with .eps files), adapted from :
# https://github.com/captnswing/msttcorefonts/blob/master/Dockerfile
# https://www.reddit.com/r/LaTeX/comments/ok3n3t/getting_errors_when_including_graphics/
ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update \
    && apt-get install -y --no-install-recommends software-properties-common ghostscript libunicode-linebreak-perl libfile-homedir-perl libyaml-tiny-perl \
    && apt-add-repository contrib \
    && apt-get update

# Increase memory limit of pdflatex to maximum to facilitate compilation of
# huge latex files
# Refer: https://tex.stackexchange.com/a/7954
RUN if [ "$BASE_IMAGE" = "registry.gitlab.com/islandoftex/images/texlive:TL2017-historic" ] ; \
    then echo "main_memory = 12435455" >> /usr/local/texlive/2017/texmf.cnf && fmtutil-sys --all ; \
    else echo "main_memory = 12435455" >> /usr/local/texlive/2021/texmf.cnf && fmtutil-sys --all ; \
    fi

# If you want to use Microsoft fonts in reports, you must install the fonts
# Andale Mono, Arial Black, Arial, Comic Sans MS, Courier New, Georgia, Impact,
# Times New Roman, Trebuchet, Verdana,Webdings)
RUN echo "ttf-mscorefonts-installer msttcorefonts/accepted-mscorefonts-eula select true" | debconf-set-selections \
    && apt-get install -y --no-install-recommends fontconfig ttf-mscorefonts-installer
ADD localfonts.conf /etc/fonts/local.conf
RUN fc-cache -f -v

WORKDIR /workdir

which can be started via docker run --rm -it --volume "`pwd`:/workdir" $IMAGE:latest /bin/bash

ftilmann commented 2 years ago

Sorry for much delayed reaction. In line 2220, where the error occurs, the input body text is split into tokens.As for normal texts this part is fast and only a small fraction of the overall time, I don't expect there is a major problem with the way the regexes are set up. The problem could be the sheer size of the files, or more likely that they do not conform to the expectations of non-picture environment latex. If already you are running into trouble when tokenizing your input files, for sure you will be in trouble when trying to run the diff algorithm. I am not quite sure what you are trying to do here. Running latexdiff on image data does not make sense, as you cannot mark this up graphically the same way. It's designed for highlighting changes in text, and tries to do a reasonable job for equations (though not always succeeds). Skip the --flatten option and everything will probably work fine (you will just see the new version of the picture of course), obviously it does not make sense for your MWE. If you need the --flatten option due to the context missing from the MWE, you can use \newcommand to define an alias for \input which is unknown to latexdiff, and thus will not be flattened. I will close this issue for now, as it's unclear what you want to achieve. Debugging the regex to avoid this error on this large tikz-picture would be a lot of work, probably, and I cannot see the usefulness right now. If I misunderstood, please feel free to reopen and better describe your use case or ideally a smaller file triggering the recursion limit.