brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
957 stars 101 forks source link

Deactivate underscores when expanding natbib's \bibitem[label] #2385

Open dginev opened 3 months ago

dginev commented 3 months ago

This is a minor change avoiding a needless error in natbib's \bibitem.

A minimal motivating example (that I could turn into a test) is:

\documentclass{article}
\usepackage{natbib}
\begin{document}

\begin{thebibliography}{1}

\bibitem[_xy(1899)]{_xyz_1899}
 A name of something. (accessed Nov 01, 1899).

\end{thebibliography}
\end{document}

Note the underscores in the \bibitem use, especially the one in the optional label [] argument. These survive well under pdflatex -- and to my observations are largely ignored, at least in the specific document I am studying that uses this.

With the current latexml master, this example produces two unfortunate errors of the kind:

Error:unexpected:_ Script _ can only appear in math mode at test.tex; line 7 col 0
Error:unexpected:_ Script _ can only appear in math mode at test.tex; line 7 col 0

The PR simply switches the offending argument to Semiverbatim in the natbib parser, deactivating the underscore's math behavior.

dginev commented 3 months ago

This idea may warrant some extra discussion... If we want math mode constructs to still expand in this argument,such as:

\bibitem[Ex$\ddot{a}$mple(1899)]{...}

then Semiverbatim is a bit misplaced - it deactivates the $ , but will expand the \ddot in the natbib Expand($label) call.

Maybe I should invent a new parameter type, which only deactivates the underscore? Thoughts welcome. That would look a bit more on the lines of:

DefParameterType('NatbibSemiVerbatim', sub {
    # deactivated underscore
    my $arg = $_[0]->readArg; 
    my @inactive = map {Equals($_, T_SUB) ? T_OTHER("_") : $_ } $arg->unlist; 
    return Tokens(@inactive); });

Edit: a slightly more direct version of a new parameter, which only deactivates underscore. A bit patchy possibly, but it is a little unclear which behavior natbib is aiming for exactly.

dginev commented 2 months ago

The general observation is that when a bare label is used in natbib's \bibitem[label] - but its entry isn't cited - pdflatex won't emit an error. I believe this has to do with writing that data out via \NAT@wrout which won't trigger expansion. Only after the written data is read back in (usually on a next call to pdflatex) could issues with underscore activation come up - and only if \cite used that entry.

So, for now, I have decided to not change the parameter types, but instead guard LaTeXML's emulation which uses an explicit Expand() call. Deactivating the underscores prior is sufficient.

I also added a test for this kind of tortured use case.

brucemiller commented 2 months ago

Your last observation almost gets it, I think. This label argument is getting expanded before writing to the aux file, but it is not digested until later, and only if the bibitem is cited. So that would mean that undefined macros or # will cause immediate problems during latex's expansion of the label, but tokens that only affect digestion will pass through until they're cited - if ever! So, not just _, but ^, & or even a single $ (or really any sequence that can't be digested) would be ignored by latex if not cited, but (currently) cause problems for LaTeXML. Moreover, _ itself isn't the problem; it's fine inside of, say \bibitem{foo$a_b$(1999)}{underscore}. Arguably these documents are in "Error", even if they don't cause errors, so I wonder how deep we should go. But if we were to try to fix it, I think we need to track where the label gets digested and use some kind of error-free digestion(?)

dginev commented 2 months ago

Good point, we should be approaching this even more generally. Having a dedicated parameter type that "postpones" the errors of certain Digest steps could be tricky... But maybe there is something there.

We have a natural place to anchor such a new parameter, at the DefConstructor for \NAT@@wrout. It may be worth playing around a bit with the example I had concocted. I'll investigate.