adsabs / bumblebee

🐝 Clever face for ADS
https://ui.adsabs.harvard.edu
GNU General Public License v2.0
38 stars 22 forks source link

LaTeX in titles is wrongly escaped #2135

Open michamos opened 3 years ago

michamos commented 3 years ago

While trying to improve handling of LaTeX in arXiv titles in INSPIRE (which are basically unstructured strings and might or might not contain LaTeX macros, which is problematic when wanting to make sense of them in BibTeX/LaTeX citation snippets), I checked how you're doing things, and noticed that you're wrongly escaping backslashes, producing invalid LaTeX.

Apologies if this is not the right place to report this bug, I know nothing about your architecture and this repo seemed the most active.

Expected Behavior

bibtex would contain something like

title = "{Measurement of the $\Sigma$ beam asymmetry for the $\omega$ photo-production off the proton and the neutron at GRAAL}"

in order to get a compilable title (Greek letters are not allowed outside of math mode by default). That's very hard to achieve as you'd need to somehow interpret the title. A valid fix to your current approach would be

title = "{Measurement of the \textbackslash{}Sigma\textbackslash{} beam asymmetry for the \textbackslash{}omega\textbackslash{} photo-production off the proton and the neutron at GRAAL}"

Note the addition of {} after the inserted macros to make sure they're not glued to the next word and spaces after the macro don't get swallowed.

Actual Behavior

bibtex output contains

title = "{Measurement of the \textbackslashSigma\textbackslash beam asymmetry for the \textbackslashomega\textbackslash photo-production off the proton and the neutron at GRAAL}"

but \textbackslashSigma is not a valid macro, and \textbackslash beam eats the space, producing \beam in the output.

Steps to Reproduce

Go to https://ui.adsabs.harvard.edu/abs/2013arXiv1306.5943V/exportcitation and look at the title in the bibtex snippet.

marblestation commented 3 years ago

Not sure if this user-reported problem is an export or data problem actually @aaccomazzi @golnazads

golnazads commented 3 years ago

@marblestation I am adding an author format to export today. I shall look to see if I can fix this. good timing bringing this up.

michamos commented 3 years ago

Hi @marblestation,

(I'm not really a user, I'm part of the team running INSPIRE, and was looking at how you're doing things because we had similar issues). I think you're asking a very good question, and I think it's actually both a data and an export problem.

The export problem is that your LaTeX escaping is incorrect, generating macro names that are not defined as you don't separate them from the next LaTeX token correctly.

The root data problem is coming from arXiv, where there is no guarantee about whether the titles (and other fields such as abstracts or comments) contain LaTeX macros outside of math mode. They officially support a limited number of escape sequences to compensate for their lack of unicode support (such as \"o to write ö as in Schrödinger) but they actually support more (such as a Greek letters, so \mu gets rendered as μ on the arXiv splash page, even if out-of-the-box LaTeX will refuse to compile that as Greek macros are allowed only in math mode). On top of that, many records use TeX macros outside of math mode to convey some information, even if they don't render nicely on the arXiv side. That's not an issue there, but it becomes an issue for downstream services such as INSPIRE or ADS, which offer LaTeX based export formats and need to know whether a title is valid LaTeX to decide whether to escape it (as you're trying to do), or simply pass it through.

FYI, the strategy we've adopted is two-fold: we try to decode LaTeX macros outside of math mode when it's possible to do so without too much loss during harvesting (we're using a suitably configured pylatexenc for that). The untranslated bits might be valid macros but there's no way to know, so we use pylatexenc again to encode the whole thing when generating the LaTeX export formats.

Let me know if you have questions, we've thought about this issue quite a lot and have gone through several iterations before landing on this solution, see https://github.com/inspirehep/hepcrawl/pull/299 for more info and test cases.

golnazads commented 3 years ago

Just checked this record in our database and this is what I get for title, "title":["Measurement of the \Sigma\ beam asymmetry for the \omega\ photo-production off the proton and the neutron at GRAAL"] so basically by the time export sees the title there are double slashes instead of $. If it makes sense I can replace double slash with \textbackslash{}. Not sure if all double slashes could be mapped this way though.

On Mon, May 10, 2021 at 9:09 AM Micha Moskovic @.***> wrote:

Hi @marblestation https://github.com/marblestation,

(I'm not really a user, I'm part of the team running INSPIRE, and was looking at how you're doing things because we had similar issues). I think you're asking a very good question, and I think it's actually both a data and an export problem.

The export problem is that your LaTeX escaping is incorrect, generating macro names that are not defined as you don't separate them from the next LaTeX token correctly.

The root data problem is coming from arXiv, where there is no guarantee about whether the titles (and other fields such as abstracts or comments) contain LaTeX macros outside of math mode. They officially support a limited number of escape sequences to compensate for their lack of unicode support (such as \"o to write ö as in Schrödinger) but they actually support more (such as a Greek letters, so \mu gets rendered as μ on the arXiv splash page, even if out-of-the-box LaTeX will refuse to compile that as Greek macros are allowed only in math mode). On top of that, many records use TeX macros outside of math mode to convey some information, even if they don't render nicely on the arXiv side. That's not an issue there, but it becomes an issue for downstream services such as INSPIRE or ADS, which offer LaTeX based export formats and need to know whether a title is valid LaTeX to decide whether to escape it (as you're trying to do), or simply pass it through.

FYI, the strategy we've adopted is two-fold: we try to decode LaTeX macros outside of math mode when it's possible to do so without too much loss during harvesting (we're using a suitably configured pylatexenc https://github.com/inspirehep/hepcrawl/blob/2f2b0fb2251700a08ffe75394c0b19980267e8b3/hepcrawl/parsers/arxiv.py#L49-L91 for that). The untranslated bits might be valid macros but there's no way to know, so we use pylatexenc again to encode https://github.com/inspirehep/inspirehep/blob/9e8060d78172520614bf007bf3ba8291da5288dd/backend/inspirehep/records/marshmallow/literature/utils.py#L37-L66 the whole thing when generating the LaTeX export formats.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adsabs/bumblebee/issues/2135#issuecomment-836679140, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG3M4CEF4HYXQCUMPMIS55LTM7LH3ANCNFSM4ZUZZX6Q .

aaccomazzi commented 3 years ago

The actual text in the title field is this: Measurement of the \Sigma\ beam asymmetry for the \omega\ photo-production off the proton and the neutron at GRAAL, so these are single slashes (they appear double within json since the first backslash is an escape).

Note that the data problem comes directly from arXiv as there are no math mode characters in the title surrounding the greek letters.

But to start, we should do what Micha suggested: outside of math mode replace a single backslash with \textbackslash{} rather than \textbackslash