Closed petercorke closed 3 years ago
Thanks for this. I'll take a proper look at this over the next few days.
In the meantime, how does the script behave on your big file without the m switch active?
The m switch is a known memory hog.
On Tue, 6 Jul 2021, 05:56 Peter Corke, @.***> wrote:
Please provide the following when posting an issue: original .tex code
It's a big file, a book manuscript that I can't post publicly but happy to share privately.
Here's a snippet, and the corresponding output is given further down. Even this snippet in a file by itself doesn't format right, so I'm guessing I'm doing something wrong.
\subsection{Images from Files}\label{sec:12.1.1}
We start with images stored in files since it is very likely that you already have lots of images stored on your computer. In this chapter we will work with some images provided with the Toolbox, but you can easily substitute your own images. We import an image into the MATLAB workspace using the Toolbox function \lstinline{iread}\FUNCTION{iread}\begin{lstlisting}
street = iread('street.png');\end{lstlisting} which returns a matrix\FUNCTION{about}\begin{lstlisting} about(street) street [uint8] : 851x1280 (1.1 MB)\end{lstlisting} that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or \INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the \lstinline{'gamma'} option for \lstinline{iread} to perform gamma decoding and obtain pixel values proportional to scene luminance.} luminance of that point in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255 (brightest). The image is shown in \figref{fig:12.1}{a}. \begin{figure}[t!]\includegraphics{M_4_012_001}\caption{The \lstinline{idisp} image browsing window. The top right shows the coordinate and value of the last pixel clicked on the image. The \textit{buttons} at the top left allow the pixel values along a line to be plotted, a \INDEX{histogram}histogram to be displayed, or the image to be zoomed. \textbf{a}~Greyscale image; \textbf{b}~color image}\label{fig:12.1}\end{figure}
yaml settings
modifyLineBreaks: preserveBlankLines: 1 condenseMultipleBlankLinesInto: 1 textWrapOptions: columns: 100
actual/given output
\subsection{Images from Files}\label{sec:12.1.1}
We start with images stored in files since it is very likely that you already have lots of images stored on your computer. In this chapter we will work with some images provided with the Toolbox, but you can easily substitute your own images. We import an image into the MATLAB workspace using the Toolbox function \lstinline{iread}\FUNCTION{iread} \begin{lstlisting}
street = iread('street.png'); \end{lstlisting} which returns a matrix\FUNCTION{about} \begin{lstlisting} about(street) street [uint8] : 851x1280 (1.1 MB) \end{lstlisting} that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or \INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the \lstinline{'gamma'} option for \lstinline{iread} to perform gamma decoding and obtain pixel values proportional to scene luminance.} luminance of that point in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255 (brightest). The image is shown in \figref{fig:12.1}{a}
the line after the second\end{ lstlisting} is very long in the input. On output it has been broken/formatted oddly. The first line is very long, the second ok, the third is indented.
A similar things happens with the caption. desired or expected output
- no lines longer than the requested 100 chars
- no indents in the middle of blocks
- no memory hogging
anything else
It's a big file, that is included into a main, so there is no preamble. I use custom commands/macros but they are pretty simple, just take a single argument.
The whole file is ~1900 lines long and it just hangs. After a couple of minutes its pushing 10GB of memory. Even with -tt nothing is written to the log file. If I take the first half of file it works quite quickly but the formatting is not what I was expecting, see above.
I had hoped that slicing/dicing the file would work but there's something in the file that is throwing it.
I have tried removing or translating unicode chars to ASCII but that didn't help.
% latexindent -v 3.9.3, 2021-05-07
% perl -v
This is perl 5, version 34, subversion 0 (v5.34.0) built for darwin-thread-multi-2level
Copyright 1987-2021, Larry Wall
running on macOS Catalina.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cmhughes/latexindent.pl/issues/279, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ7CYGDZH4A7MTVDQLSQKDTWKEHNANCNFSM473ZE5UA .
It works fine, but the line splitting is really what I seek. The latex has been spat out of an automatic translator and it's a mess...
I'll study it over the next few days, but in the meantime see my explanation at https://github.com/cmhughes/latexindent.pl/issues/228
SEE THE NEXT COMMENT, ITS CLEARER
That perCodeBlockBasis
made very small difference but not important. But using it
modifyLineBreaks:
preserveBlankLines: 1
condenseMultipleBlankLinesInto: 1
textWrapOptions:
columns: 100
perCodeBlockBasis: 1
all: 1
I made an interesting discovery:
minimum problem text
that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or \INDEX{grey value}grey values and are
the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the
\lstinline{'gamma'} option for \lstinline{iread} to perform gamma decoding and obtain pixel values
proportional to scene luminance.} luminance of that point in the original scene. For this 8-bit
image the pixel values vary from 0 (darkest) to 255 (brightest). The image is shown in
\figref{fig:12.1}{a}.
just changing the first lstinline
command to FUNCTION
dramatically changed the result:
that belongs to the class \FUNCTION{uint8}\FUNCTION{uint8} -- the elements of the matrix
are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as
pixel values or \INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the
\lstinline{'gamma'} option for \lstinline{iread} to
perform gamma decoding and obtain pixel values proportional to scene luminance.} luminance of that
point in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255
(brightest). The image is shown in \figref{fig:12.1}{a}.
Is the 4th line, indented, due to it being part of the very long \SIDENOTE
command?
Changing all instances I get
that belongs to the class \FUNCTION{uint8}\FUNCTION{uint8} -- the elements of the matrix
are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as
pixel values or \INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the \FUNCTION{'gamma'}
option for \FUNCTION{iread} to perform gamma decoding and obtain pixel values proportional to
scene luminance.}
luminance of that point in the original scene. For this 8-bit image the pixel values vary from 0
(darkest) to 255 (brightest). The image is shown in \figref{fig:12.1}{a}.
which is looking pretty decent. I like what it's done with the SIDENOTE
argument, but what does it have against lstinline
?
Here's a clearer example. The problem long line is:
that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or \INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the \lstinline{'gamma'} option for \lstinline{iread} to perform gamma decoding and obtain pixel values proportional to scene luminance.} luminance of that point in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255 (brightest). The image is shown in \figref{fig:12.1}{a}.
using settings
modifyLineBreaks:
textWrapOptions:
columns: 100
it gets formatted as:
1 that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or \INDEX{grey value}grey values and are the
2 gamma-encoded\SIDENOTE{Gamma encoding and decoding is discussed in \secref{sec:10.3.6}. Use the
3 \lstinline{'gamma'} option for \lstinline{iread} to perform gamma decoding and obtain pixel values proportional to
4 scene luminance.} luminance of that point in the original scene. For this 8-bit image the pixel
5 values vary from 0 (darkest) to 255 (brightest). The image is shown in \figref{fig:12.1}{a}.
I ran it through cat -n
to make the line breaks clearer. 5 lines, line 1 is > 200 chars long, line 3 is 115 chars long
but if I just change \lstinline
to \foobarbaz
, the command has the same number of letters, I get instead:
1 that belongs to the class \FUNCTION{uint8}\foobarbaz{uint8} -- the elements of the matrix are
2 unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or
3 \INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is
4 discussed in \secref{sec:10.3.6}. Use the \foobarbaz{'gamma'} option for \foobarbaz{iread} to
5 perform gamma decoding and obtain pixel values proportional to scene luminance.} luminance of that
6 point in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255
7 (brightest). The image is shown in \figref{fig:12.1}{a}.
which is what I was hoping for. I grepped the source files for latexindent
and saw no mention of lstinline
so there's no special case going awry.
Hello, thanks for your patience.
On the performance, as of https://github.com/cmhughes/latexindent.pl/commit/b7790e8707eb261fbe74f1afbc5e44f7e17a3871, the -m
switch routine is a lot more efficient. There is a benchmark detail given in https://github.com/cmhughes/latexindent.pl/issues/268; it'll be part of the next release, but if you're welcome to pull from develop
in the meantime.
For your example, the particular thing is about the https://latexindentpl.readthedocs.io/en/latest/sec-default-user-local.html#lst-verbatimcommands field.
So, if we use:
modifyLineBreaks:
textWrapOptions:
columns: 100
verbatimCommands:
lstinline: 0
then we receive
that belongs to the class \FUNCTION{uint8}\lstinline{uint8} -- the elements of the matrix are
unsigned 8-bit integers in the interval $[0,255]$. The elements are referred to as pixel values or
\INDEX{grey value}grey values and are the gamma-encoded\SIDENOTE{Gamma encoding and decoding is
discussed in \secref{sec:10.3.6}. Use the \lstinline{'gamma'} option for \lstinline{iread} to
perform gamma decoding and obtain pixel values proportional to scene luminance.} luminance of that
point in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255
(brightest). The image is shown in \figref{fig:12.1}{a}.
As in https://github.com/cmhughes/latexindent.pl/issues/228 I think that the textWrap
routine could use some review. It'll be one of the next priorities.
Does this make sense?
Thanks for that, I'm not sure I do understand, but I do get that it gives the result I want :)
You've declared lstinline
as a verbatim field here and I'm guessing that 0 is a poly-switch (the doco at the reference you give doesn't explicitly say) that is turning off line breaking for this command. That makes it equivalent to foobarbaz
which it didn't know about. So somewhere inside latexindent
it does in fact know about lstinline
.
The pieces of verbatim text are quite small so for the problem paragraph I still don't understand why this option had such an outsized effect on the line breaks, there is no verbatim text near the split point.
Happy news is it works for the whole chapter, 1900 output lines and took ~40seconds which is fine for such one off conversions. I did try to run the dev version you mentioned, but I didn't see a speedup. Big caveat though, I'm not a perl person and I might have screwed up the module paths, using some stuff from the Mac brew installation and some from the GH clone nailed together in an understanding-free manner.
Thanks for your time in answering, and for all the effort you've put into this tool. It is awesome.
Thanks for the follow-up.
By default, lstinline
is considered a verbatimCommand; this means that when latexindent.pl
replaces it with a 'verbatim token', and the body of your text looks like the following:
that belongs to the class LTXIN-TK-COMMAND1-ENDLTXIN-TK-VERBATIM1-ENDgrey value}grey values and are the\ngamma-encodedLTXIN-TK-COMMAND4-END to perform gamma decoding and obtain pixel values proportional to\nscene luminance.} luminance of that point in the original scene. For this 8-bit image the pixel\nvalues vary from 0 (darkest) to 255 (brightest). The image is shown in LTXIN-TK-COMMAND3-END.\n
You can check this for yourself, if you like, by using the -tt
switch and then examining the log file, indent.log
. I don't recommend using the -tt
switch for anything other than debugging/exploration, it slows things down a lot and makes indent.log
very big.
Note that the above text is not ideal; the lstinline
looks for \lstinline
immediately followed by 'something', which in your case is {
and then it takes the body as anything up to the next occurrence of {
. The documentation needs to be updated to make this clearer.
By using
verbatimCommands:
lstinline: 0
we tell latexindent.pl
not to treat lstinline
as a verbatim command and instead, the body of text looks like
that belongs to the class LTXIN-TK-COMMAND1-ENDLTXIN-TK-COMMAND2-END -- the elements of the matrix are\nunsigned 8-bit integers in the interval LTXIN-TK-SPECIAL1-END. The elements are referred to as pixel values or\nLTXIN-TK-COMMAND3-ENDgrey values and are the gamma-encodedLTXIN-TK-COMMAND8-END luminance of that\npoint in the original scene. For this 8-bit image the pixel values vary from 0 (darkest) to 255\n(brightest). The image is shown in LTXIN-TK-COMMAND7-END.\n
If you have the develop
version on your machine, then I would expect you to see a significant improvement in performance. You can check which version has been used for your file by checking indent.log
; the first few lines are the most relevant to this:
INFO: latexindent.pl version 3.10, 2021-06-19, a script to indent .tex files
latexindent.pl lives here: /home/cmhughes/projects/latexindent/
Sun Jul 11 07:40:53 2021
Filename: issue-279.tex
INFO: Processing switches:
-l|--localSettings: Read localSettings YAML file
-m|--modifylinebreaks: modify line breaks
INFO: Directory for backup files and indent.log: .
INFO: Perl modules are being loaded from the following directories:
/home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/5.30.3/FindBin.pm
/home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/site_perl/5.30.3/YAML/Tiny.pm
/home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/5.30.3/File/Copy.pm
/home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/5.30.3/File/Basename.pm
/home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/5.30.3/Getopt/Long.pm
/home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/site_perl/5.30.3/File/HomeDir.pm
/home/cmhughes/perl5/perlbrew/perls/perl-5.30.3/lib/site_perl/5.30.3/x86_64-linux/Unicode/GCString.pm
INFO: LatexIndent perl modules are being loaded from, for example:
/home/cmhughes/projects/latexindent/LatexIndent/Document.pm
I hope this helps :)
Assuming that the verbatim extends to the next {
makes it seem much longer than it actually is. Also if there was a {
inside the verbatim text it would be truncated. Nested brackets are hard/impossible to do with regexps.
Thanks for the tip about which versions of code are loaded, that's not urgent for me now, but I'll have a tinker.
Following the details above, explicitly https://github.com/cmhughes/latexindent.pl/commit/b7790e8707eb261fbe74f1afbc5e44f7e17a3871, I'm going to label this as implemented. It'll be noted in the next release; I'll leave this open until released (hopefully soon).
Released as of https://github.com/cmhughes/latexindent.pl/releases/tag/V3.10.1, uploaded to ctan, you should be able to get it using your TeX distribution manager within about 24 hours, ctan allowing.
Please provide the following when posting an issue:
original .tex code
It's a big file, a book manuscript that I can't post publicly but happy to share privately.
Here's a snippet, and the corresponding output is given further down. Even this snippet in a file by itself doesn't format right, so I'm guessing I'm doing something wrong.
yaml settings
actual/given output
the line after the second
\end{ lstlisting}
is very long in the input. On output it has been broken/formatted oddly. The first line is very long, the second ok, the third is indented.A similar things happens with the caption.
desired or expected output
anything else
It's a big file, that is included into a main, so there is no preamble. I use custom commands/macros but they are pretty simple, just take a single argument.
The whole file is ~1900 lines long and it just hangs. After a couple of minutes its pushing 10GB of memory. Even with -tt nothing is written to the log file. If I take the first half of file it works quite quickly but the formatting is not what I was expecting, see above.
I had hoped that slicing/dicing the file would work but there's something in the file that is throwing it.
I have tried removing or translating unicode chars to ASCII but that didn't help.
running on macOS Catalina.