Open kmccurley opened 1 year ago
Make sure to also filter out "comments" produced by other means such as \iffalse blocks.
There is a python library that does something like this, and I think we should use that. It can filter out things like \begin{comment} ... \end{comment} that is present in the latex comments package.
Someone could have used the todonotes package with \usepackage[disable]{todonotes} and have their todo notes not show up. One easy way to overcome this is to remove everything like \todo{...} but I'm not sure if the arxiv_latex_cleaner handles optional arguments on a macro.
This issue is solved by downstream processing of the \jobname.abstract file. It turns out that \todo[inline]{this is a todo} will not be removed, because arxiv_latex_cleaner does not support this.
This issue was moved because it's not easy to clean up abstracts within the LaTeX programming environment, so it's better left to the downstream processing. As it turns out, we now use arxiv-latex-cleaner
to remove comments, and it works pretty well. Note however that a trailing % on a line is not removed by this module, because this is used in LaTeX to inhibit the introduction of undesired white space (particularly within macro definitions).
The purpose of extracting the abstract is to fulfill multiple purposes in downstream processing. The abstract will be used as follows:
In each case we require different kinds of escaping for special characters. For example, in XML or HTML, we need to escape a < character so that it isn't confused with the beginning of an XML tag. This can be accomplished with a CDATA section in XML, and use of < in HTML. Note that the abstract
part of JATS and crossref expect to find
subtags in them to separate paragraphs. At the current time I split on '\n\n' and join them with
tags. In this case we should probably remove any trailing % symbol that is not preceded by .
Perhaps the most difficult part is to display the abstract in a sensible way in the web page for the paper. We typically use MathJax to render mathematics, so that simple things like $\zeta(s^2)$ would be rendered correctly. MathJax is not a full LaTeX parser, but is able to deal with quite a bit of LaTeX formatting. Unfortunately MathJax removed word-wrap in 3.0, so what shows up in HTML will include newline characters. I tried using the CSS of white-space:pre-wrap;
but that obliterates paragraphs. I'll have to write some better tests for handling this, because abstracts need to be readable. Nobody is perfect on this - see what google displays in scholar. The vast majority of abstracts in arxiv have only a single paragraph, but some have multiple paragraphs that they handle. They also do a pretty good job with display mathematics. They have interesting instructions on preparation of abstracts, saying that newlines are stripped unless they are followed by whitespace (e.g., indentation).
The current implementation of iacrcc.cls does a direct copy of the contents of the
abstract
environment into the \jobname.abstract file. As a result, it captures LaTeX comments, which is unfortunate. The implementation uses the\Verbatimout{}
macro from fancyvrb, which does a complete copy. Thefancyvrb
package has thecommentchar
argument to theVerbatim
environment, but that doesn't work on the\Verbatimout
environment (it works on\Verbatiminput
)There are several possible solutions:
fancyvrb
.\commentchar
option to\Verbatimout
. This is risky and not advisable in case fancyvrb is updated.%
as starting a comment but not\%
. We could usere.split(r'(^%|[^\\]%)', line)
on all of the lines in the file, and just take the first part if it's nonempty.