cmhughes / latexindent.pl

Perl script to add indentation (leading horizontal space) to LaTeX files. It can modify line breaks before, during and after code blocks; it can perform text wrapping and paragraph line break removal. It can also perform string-based and regex-based substitutions/replacements. The script is customisable through its YAML interface.
GNU General Public License v3.0
884 stars 84 forks source link

Weird indentation and memory hogging #279

Closed petercorke closed 3 years ago

petercorke commented 3 years ago

Please provide the following when posting an issue:

original .tex code

It's a big file, a book manuscript that I can't post publicly but happy to share privately.

Here's a snippet, and the corresponding output is given further down. Even this snippet in a file by itself doesn't format right, so I'm guessing I'm doing something wrong.

\subsection{Images from Files}\label{sec:12.1.1}

We start with images stored in files since it is very likely that you already have lots of images stored on your computer. In this chapter we will work with some images provided with the Toolbox, but you can easily substitute your own images. We import an image into the MATLAB workspace using the Toolbox function \lstinline{iread}\FUNCTION{iread}
\begin{lstlisting}
>> street = iread('street.png');
\end{lstlisting}
which returns a matrix\FUNCTION{about}
\begin{lstlisting}
>> about(street)
street [uint8] : 851x1280 (1.1 MB)
\end{lstlisting}
that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or \INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the \lstinline{'gamma'} option for \lstinline{iread} to perform gamma decoding and obtain pixel values proportional to scene luminance.} luminance of that point in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255 (brightest). The image is shown in \figref{fig:12.1}{a}.

\begin{figure}[t!]
\includegraphics{M_4_012_001}
\caption{The \lstinline{idisp} image browsing window. The top right shows the coordinate and value of the last pixel clicked on the image. The \textit{buttons} at the top left allow the pixel values along a line to be plotted, a \INDEX{histogram}histogram to be displayed, or the image to be zoomed. \textbf{a}~Greyscale image; \textbf{b}~color image}
\label{fig:12.1}
\end{figure}

yaml settings

modifyLineBreaks: 
   preserveBlankLines: 1
   condenseMultipleBlankLinesInto: 1
   textWrapOptions: 
       columns: 100 

actual/given output

\subsection{Images from Files}\label{sec:12.1.1}

We start with images stored in files since it is very likely that you already have lots of images
stored on your computer. In this chapter we will work with some images provided with the Toolbox,
but you can easily substitute your own images. We import an image into the MATLAB workspace using
the Toolbox function \lstinline{iread}\FUNCTION{iread}
\begin{lstlisting}
>> street = iread('street.png');
\end{lstlisting}
which returns a matrix\FUNCTION{about}
\begin{lstlisting}
>> about(street)
street [uint8] : 851x1280 (1.1 MB)
\end{lstlisting}
that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or \INDEX{grey value}grey values and are
the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the
    \lstinline{'gamma'} option for \lstinline{iread} to perform gamma decoding and obtain pixel values proportional to
scene luminance.} luminance of that point in the original scene. For this 8-bit image the pixel
values vary from 0 (darkest) to 255 (brightest). The image is shown in \figref{fig:12.1}{a}

the line after the second\end{ lstlisting} is very long in the input. On output it has been broken/formatted oddly. The first line is very long, the second ok, the third is indented.

A similar things happens with the caption.

desired or expected output

anything else

It's a big file, that is included into a main, so there is no preamble. I use custom commands/macros but they are pretty simple, just take a single argument.

The whole file is ~1900 lines long and it just hangs. After a couple of minutes its pushing 10GB of memory. Even with -tt nothing is written to the log file. If I take the first half of file it works quite quickly but the formatting is not what I was expecting, see above.

I had hoped that slicing/dicing the file would work but there's something in the file that is throwing it.

I have tried removing or translating unicode chars to ASCII but that didn't help.

% latexindent -v
3.9.3, 2021-05-07

% perl -v

This is perl 5, version 34, subversion 0 (v5.34.0) built for darwin-thread-multi-2level

Copyright 1987-2021, Larry Wall

running on macOS Catalina.

cmhughes commented 3 years ago

Thanks for this. I'll take a proper look at this over the next few days.

In the meantime, how does the script behave on your big file without the m switch active?

The m switch is a known memory hog.

On Tue, 6 Jul 2021, 05:56 Peter Corke, @.***> wrote:

Please provide the following when posting an issue: original .tex code

It's a big file, a book manuscript that I can't post publicly but happy to share privately.

Here's a snippet, and the corresponding output is given further down. Even this snippet in a file by itself doesn't format right, so I'm guessing I'm doing something wrong.

\subsection{Images from Files}\label{sec:12.1.1}

We start with images stored in files since it is very likely that you already have lots of images stored on your computer. In this chapter we will work with some images provided with the Toolbox, but you can easily substitute your own images. We import an image into the MATLAB workspace using the Toolbox function \lstinline{iread}\FUNCTION{iread}\begin{lstlisting}

street = iread('street.png');\end{lstlisting} which returns a matrix\FUNCTION{about}\begin{lstlisting} about(street) street [uint8] : 851x1280 (1.1 MB)\end{lstlisting} that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or \INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the \lstinline{'gamma'} option for \lstinline{iread} to perform gamma decoding and obtain pixel values proportional to scene luminance.} luminance of that point in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255 (brightest). The image is shown in \figref{fig:12.1}{a}. \begin{figure}[t!]\includegraphics{M_4_012_001}\caption{The \lstinline{idisp} image browsing window. The top right shows the coordinate and value of the last pixel clicked on the image. The \textit{buttons} at the top left allow the pixel values along a line to be plotted, a \INDEX{histogram}histogram to be displayed, or the image to be zoomed. \textbf{a}~Greyscale image; \textbf{b}~color image}\label{fig:12.1}\end{figure}

yaml settings

modifyLineBreaks: preserveBlankLines: 1 condenseMultipleBlankLinesInto: 1 textWrapOptions: columns: 100

actual/given output

\subsection{Images from Files}\label{sec:12.1.1}

We start with images stored in files since it is very likely that you already have lots of images stored on your computer. In this chapter we will work with some images provided with the Toolbox, but you can easily substitute your own images. We import an image into the MATLAB workspace using the Toolbox function \lstinline{iread}\FUNCTION{iread} \begin{lstlisting}

street = iread('street.png'); \end{lstlisting} which returns a matrix\FUNCTION{about} \begin{lstlisting} about(street) street [uint8] : 851x1280 (1.1 MB) \end{lstlisting} that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or \INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the \lstinline{'gamma'} option for \lstinline{iread} to perform gamma decoding and obtain pixel values proportional to scene luminance.} luminance of that point in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255 (brightest). The image is shown in \figref{fig:12.1}{a}

the line after the second\end{ lstlisting} is very long in the input. On output it has been broken/formatted oddly. The first line is very long, the second ok, the third is indented.

A similar things happens with the caption. desired or expected output

  • no lines longer than the requested 100 chars
  • no indents in the middle of blocks
  • no memory hogging

anything else

It's a big file, that is included into a main, so there is no preamble. I use custom commands/macros but they are pretty simple, just take a single argument.

The whole file is ~1900 lines long and it just hangs. After a couple of minutes its pushing 10GB of memory. Even with -tt nothing is written to the log file. If I take the first half of file it works quite quickly but the formatting is not what I was expecting, see above.

I had hoped that slicing/dicing the file would work but there's something in the file that is throwing it.

I have tried removing or translating unicode chars to ASCII but that didn't help.

% latexindent -v 3.9.3, 2021-05-07

% perl -v

This is perl 5, version 34, subversion 0 (v5.34.0) built for darwin-thread-multi-2level

Copyright 1987-2021, Larry Wall

running on macOS Catalina.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cmhughes/latexindent.pl/issues/279, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ7CYGDZH4A7MTVDQLSQKDTWKEHNANCNFSM473ZE5UA .

petercorke commented 3 years ago

It works fine, but the line splitting is really what I seek. The latex has been spat out of an automatic translator and it's a mess...

cmhughes commented 3 years ago

I'll study it over the next few days, but in the meantime see my explanation at https://github.com/cmhughes/latexindent.pl/issues/228

petercorke commented 3 years ago

SEE THE NEXT COMMENT, ITS CLEARER

That perCodeBlockBasis made very small difference but not important. But using it

modifyLineBreaks: 
   preserveBlankLines: 1
   condenseMultipleBlankLinesInto: 1
   textWrapOptions: 
       columns: 100 
       perCodeBlockBasis: 1
       all: 1

I made an interesting discovery:

minimum problem text

that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or \INDEX{grey value}grey values and are
the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the
    \lstinline{'gamma'} option for \lstinline{iread} to perform gamma decoding and obtain pixel values
proportional to scene luminance.} luminance of that point in the original scene. For this 8-bit
image the pixel values vary from 0 (darkest) to 255 (brightest). The image is shown in
\figref{fig:12.1}{a}.

just changing the first lstinline command to FUNCTION dramatically changed the result:

that belongs to the class \FUNCTION{uint8}\FUNCTION{uint8} -- the elements of the matrix
are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as
pixel values or \INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the
    \lstinline{'gamma'} option for \lstinline{iread} to
perform gamma decoding and obtain pixel values proportional to scene luminance.} luminance of that
point in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255
(brightest). The image is shown in \figref{fig:12.1}{a}.

Is the 4th line, indented, due to it being part of the very long \SIDENOTE command?

Changing all instances I get

that belongs to the class \FUNCTION{uint8}\FUNCTION{uint8} -- the elements of the matrix
are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as
pixel values or \INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the \FUNCTION{'gamma'}
    option for \FUNCTION{iread} to perform gamma decoding and obtain pixel values proportional to
    scene luminance.}
luminance of that point in the original scene. For this 8-bit image the pixel values vary from 0
(darkest) to 255 (brightest). The image is shown in \figref{fig:12.1}{a}.

which is looking pretty decent. I like what it's done with the SIDENOTE argument, but what does it have against lstinline?

petercorke commented 3 years ago

Here's a clearer example. The problem long line is:

that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or \INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the \lstinline{'gamma'} option for \lstinline{iread} to perform gamma decoding and obtain pixel values proportional to scene luminance.} luminance of that point in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255 (brightest). The image is shown in \figref{fig:12.1}{a}.

using settings

modifyLineBreaks: 
   textWrapOptions: 
       columns: 100 

it gets formatted as:

     1  that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or \INDEX{grey value}grey values and are the
     2  gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the
     3      \lstinline{'gamma'} option for \lstinline{iread} to perform gamma decoding and obtain pixel values proportional to
     4  scene luminance.} luminance of that point in the original scene. For this 8-bit image the pixel
     5  values vary from 0 (darkest) to 255 (brightest). The image is shown in \figref{fig:12.1}{a}.

I ran it through cat -n to make the line breaks clearer. 5 lines, line 1 is > 200 chars long, line 3 is 115 chars long

but if I just change \lstinline to \foobarbaz, the command has the same number of letters, I get instead:

     1  that belongs to the class \FUNCTION{uint8}\foobarbaz{uint8} -- the elements of the matrix are
     2  unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or
     3  \INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is
     4      discussed in \secref{sec:10.3.6}. Use the \foobarbaz{'gamma'} option for \foobarbaz{iread} to
     5      perform gamma decoding and obtain pixel values proportional to scene luminance.} luminance of that
     6  point in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255
     7  (brightest). The image is shown in \figref{fig:12.1}{a}.

which is what I was hoping for. I grepped the source files for latexindent and saw no mention of lstinline so there's no special case going awry.

cmhughes commented 3 years ago

Hello, thanks for your patience.

performance (memory hogging)

On the performance, as of https://github.com/cmhughes/latexindent.pl/commit/b7790e8707eb261fbe74f1afbc5e44f7e17a3871, the -m switch routine is a lot more efficient. There is a benchmark detail given in https://github.com/cmhughes/latexindent.pl/issues/268; it'll be part of the next release, but if you're welcome to pull from develop in the meantime.

textWrap

For your example, the particular thing is about the https://latexindentpl.readthedocs.io/en/latest/sec-default-user-local.html#lst-verbatimcommands field.

So, if we use:

modifyLineBreaks: 
   textWrapOptions: 
       columns: 100 
verbatimCommands:
    lstinline: 0

then we receive

that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are
unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or
\INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is
    discussed in \secref{sec:10.3.6}. Use the \lstinline{'gamma'} option for \lstinline{iread} to
    perform gamma decoding and obtain pixel values proportional to scene luminance.} luminance of that
point in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255
(brightest). The image is shown in \figref{fig:12.1}{a}.

general textWrap review

As in https://github.com/cmhughes/latexindent.pl/issues/228 I think that the textWrap routine could use some review. It'll be one of the next priorities.

Does this make sense?

petercorke commented 3 years ago

Thanks for that, I'm not sure I do understand, but I do get that it gives the result I want :)

You've declared lstinline as a verbatim field here and I'm guessing that 0 is a poly-switch (the doco at the reference you give doesn't explicitly say) that is turning off line breaking for this command. That makes it equivalent to foobarbaz which it didn't know about. So somewhere inside latexindent it does in fact know about lstinline.

The pieces of verbatim text are quite small so for the problem paragraph I still don't understand why this option had such an outsized effect on the line breaks, there is no verbatim text near the split point.

Happy news is it works for the whole chapter, 1900 output lines and took ~40seconds which is fine for such one off conversions. I did try to run the dev version you mentioned, but I didn't see a speedup. Big caveat though, I'm not a perl person and I might have screwed up the module paths, using some stuff from the Mac brew installation and some from the GH clone nailed together in an understanding-free manner.

Thanks for your time in answering, and for all the effort you've put into this tool. It is awesome.

cmhughes commented 3 years ago

Thanks for the follow-up.

about lstinline

By default, lstinline is considered a verbatimCommand; this means that when latexindent.pl replaces it with a 'verbatim token', and the body of your text looks like the following:

that belongs to the class LTXIN-TK-COMMAND1-ENDLTXIN-TK-VERBATIM1-ENDgrey value}grey values and are the\ngamma-encodedLTXIN-TK-COMMAND4-END to perform gamma decoding and obtain pixel values proportional to\nscene luminance.} luminance of that point in the original scene. For this 8-bit image the pixel\nvalues vary from 0 (darkest) to 255 (brightest). The image is shown in LTXIN-TK-COMMAND3-END.\n

You can check this for yourself, if you like, by using the -tt switch and then examining the log file, indent.log. I don't recommend using the -tt switch for anything other than debugging/exploration, it slows things down a lot and makes indent.log very big.

Note that the above text is not ideal; the lstinline looks for \lstinline immediately followed by 'something', which in your case is { and then it takes the body as anything up to the next occurrence of {. The documentation needs to be updated to make this clearer.

By using

verbatimCommands:
    lstinline: 0

we tell latexindent.pl not to treat lstinline as a verbatim command and instead, the body of text looks like

that belongs to the class LTXIN-TK-COMMAND1-ENDLTXIN-TK-COMMAND2-END -- the elements of the matrix are\nunsigned 8-bit integers in the interval LTXIN-TK-SPECIAL1-END. The elements are referred to as pixel values or\nLTXIN-TK-COMMAND3-ENDgrey values and are the gamma-encodedLTXIN-TK-COMMAND8-END luminance of that\npoint in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255\n(brightest). The image is shown in LTXIN-TK-COMMAND7-END.\n

about performance

If you have the develop version on your machine, then I would expect you to see a significant improvement in performance. You can check which version has been used for your file by checking indent.log; the first few lines are the most relevant to this:

INFO:  latexindent.pl version 3.10, 2021-06-19, a script to indent .tex files
       latexindent.pl lives here: /home/cmhughes/projects/latexindent/
       Sun Jul 11 07:40:53 2021
       Filename: issue-279.tex
INFO:  Processing switches:
       -l|--localSettings: Read localSettings YAML file
       -m|--modifylinebreaks: modify line breaks
INFO:  Directory for backup files and indent.log: .
INFO:  Perl modules are being loaded from the following directories:
       /home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/5.30.3/FindBin.pm
       /home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/site_perl/5.30.3/YAML/Tiny.pm
       /home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/5.30.3/File/Copy.pm
       /home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/5.30.3/File/Basename.pm
       /home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/5.30.3/Getopt/Long.pm
       /home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/site_perl/5.30.3/File/HomeDir.pm
       /home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/site_perl/5.30.3/x86_64-linux/Unicode/GCString.pm
INFO:  LatexIndent perl modules are being loaded from, for example:
       /home/cmhughes/projects/latexindent/LatexIndent/Document.pm

I hope this helps :)

petercorke commented 3 years ago

Assuming that the verbatim extends to the next { makes it seem much longer than it actually is. Also if there was a { inside the verbatim text it would be truncated. Nested brackets are hard/impossible to do with regexps.

Thanks for the tip about which versions of code are loaded, that's not urgent for me now, but I'll have a tinker.

cmhughes commented 3 years ago

Following the details above, explicitly https://github.com/cmhughes/latexindent.pl/commit/b7790e8707eb261fbe74f1afbc5e44f7e17a3871, I'm going to label this as implemented. It'll be noted in the next release; I'll leave this open until released (hopefully soon).

cmhughes commented 3 years ago

Released as of https://github.com/cmhughes/latexindent.pl/releases/tag/V3.10.1, uploaded to ctan, you should be able to get it using your TeX distribution manager within about 24 hours, ctan allowing.