IACR / latex-submit

Web server to receive uploaded LaTeX and execute it in a docker container.
GNU Affero General Public License v3.0
11 stars 0 forks source link

Abstract should omit LaTeX comments #58

Open kmccurley opened 1 year ago

kmccurley commented 1 year ago

The current implementation of iacrcc.cls does a direct copy of the contents of the abstract environment into the \jobname.abstract file. As a result, it captures LaTeX comments, which is unfortunate. The implementation uses the \Verbatimout{} macro from fancyvrb, which does a complete copy. The fancyvrb package has the commentchar argument to the Verbatim environment, but that doesn't work on the \Verbatimout environment (it works on \Verbatiminput)

There are several possible solutions:

  1. switch to something that strips comments out instead of fancyvrb.
  2. add the \commentchar option to \Verbatimout. This is risky and not advisable in case fancyvrb is updated.
  3. eliminate the comments when we use python to parse the abstract. This is only a little complicated, because we need to treat % as starting a comment but not \%. We could use re.split(r'(^%|[^\\]%)', line) on all of the lines in the file, and just take the first part if it's nonempty.
jwbos commented 1 year ago

Make sure to also filter out "comments" produced by other means such as \iffalse blocks.

kmccurley commented 1 year ago

There is a python library that does something like this, and I think we should use that. It can filter out things like \begin{comment} ... \end{comment} that is present in the latex comments package.

kmccurley commented 1 year ago

Someone could have used the todonotes package with \usepackage[disable]{todonotes} and have their todo notes not show up. One easy way to overcome this is to remove everything like \todo{...} but I'm not sure if the arxiv_latex_cleaner handles optional arguments on a macro.

kmccurley commented 1 year ago

This issue is solved by downstream processing of the \jobname.abstract file. It turns out that \todo[inline]{this is a todo} will not be removed, because arxiv_latex_cleaner does not support this.

kmccurley commented 9 months ago

This issue was moved because it's not easy to clean up abstracts within the LaTeX programming environment, so it's better left to the downstream processing. As it turns out, we now use arxiv-latex-cleaner to remove comments, and it works pretty well. Note however that a trailing % on a line is not removed by this module, because this is used in LaTeX to inhibit the introduction of undesired white space (particularly within macro definitions).

The purpose of extracting the abstract is to fulfill multiple purposes in downstream processing. The abstract will be used as follows:

  1. the web environment, to show the abstract as HTML
  2. the web environment, to show the DC:Description tag. This can be used by Google scholar when they crawl, but need not be faithful to paragraph boundaries or mathematics.
  3. the crossref reporting schema. This allows harvesters to extract the abstract.
  4. OAI-PMH harvesting.

In each case we require different kinds of escaping for special characters. For example, in XML or HTML, we need to escape a < character so that it isn't confused with the beginning of an XML tag. This can be accomplished with a CDATA section in XML, and use of < in HTML. Note that the abstract part of JATS and crossref expect to find

subtags in them to separate paragraphs. At the current time I split on '\n\n' and join them with

tags. In this case we should probably remove any trailing % symbol that is not preceded by .

Perhaps the most difficult part is to display the abstract in a sensible way in the web page for the paper. We typically use MathJax to render mathematics, so that simple things like $\zeta(s^2)$ would be rendered correctly. MathJax is not a full LaTeX parser, but is able to deal with quite a bit of LaTeX formatting. Unfortunately MathJax removed word-wrap in 3.0, so what shows up in HTML will include newline characters. I tried using the CSS of white-space:pre-wrap; but that obliterates paragraphs. I'll have to write some better tests for handling this, because abstracts need to be readable. Nobody is perfect on this - see what google displays in scholar. The vast majority of abstracts in arxiv have only a single paragraph, but some have multiple paragraphs that they handle. They also do a pretty good job with display mathematics. They have interesting instructions on preparation of abstracts, saying that newlines are stripped unless they are followed by whitespace (e.g., indentation).