Open michamos opened 3 years ago
Not sure if this user-reported problem is an export or data problem actually @aaccomazzi @golnazads
@marblestation I am adding an author format to export today. I shall look to see if I can fix this. good timing bringing this up.
Hi @marblestation,
(I'm not really a user, I'm part of the team running INSPIRE, and was looking at how you're doing things because we had similar issues). I think you're asking a very good question, and I think it's actually both a data and an export problem.
The export problem is that your LaTeX escaping is incorrect, generating macro names that are not defined as you don't separate them from the next LaTeX token correctly.
The root data problem is coming from arXiv, where there is no guarantee about whether the titles (and other fields such as abstracts or comments) contain LaTeX macros outside of math mode. They officially support a limited number of escape sequences to compensate for their lack of unicode support (such as \"o
to write ö
as in Schrödinger
) but they actually support more (such as a Greek letters, so \mu
gets rendered as μ
on the arXiv splash page, even if out-of-the-box LaTeX will refuse to compile that as Greek macros are allowed only in math mode). On top of that, many records use TeX macros outside of math mode to convey some information, even if they don't render nicely on the arXiv side. That's not an issue there, but it becomes an issue for downstream services such as INSPIRE or ADS, which offer LaTeX based export formats and need to know whether a title is valid LaTeX to decide whether to escape it (as you're trying to do), or simply pass it through.
FYI, the strategy we've adopted is two-fold: we try to decode LaTeX macros outside of math mode when it's possible to do so without too much loss during harvesting (we're using a suitably configured pylatexenc
for that). The untranslated bits might be valid macros but there's no way to know, so we use pylatexenc
again to encode the whole thing when generating the LaTeX export formats.
Let me know if you have questions, we've thought about this issue quite a lot and have gone through several iterations before landing on this solution, see https://github.com/inspirehep/hepcrawl/pull/299 for more info and test cases.
Just checked this record in our database and this is what I get for title, "title":["Measurement of the \Sigma\ beam asymmetry for the \omega\ photo-production off the proton and the neutron at GRAAL"] so basically by the time export sees the title there are double slashes instead of $. If it makes sense I can replace double slash with \textbackslash{}. Not sure if all double slashes could be mapped this way though.
On Mon, May 10, 2021 at 9:09 AM Micha Moskovic @.***> wrote:
Hi @marblestation https://github.com/marblestation,
(I'm not really a user, I'm part of the team running INSPIRE, and was looking at how you're doing things because we had similar issues). I think you're asking a very good question, and I think it's actually both a data and an export problem.
The export problem is that your LaTeX escaping is incorrect, generating macro names that are not defined as you don't separate them from the next LaTeX token correctly.
The root data problem is coming from arXiv, where there is no guarantee about whether the titles (and other fields such as abstracts or comments) contain LaTeX macros outside of math mode. They officially support a limited number of escape sequences to compensate for their lack of unicode support (such as \"o to write ö as in Schrödinger) but they actually support more (such as a Greek letters, so \mu gets rendered as μ on the arXiv splash page, even if out-of-the-box LaTeX will refuse to compile that as Greek macros are allowed only in math mode). On top of that, many records use TeX macros outside of math mode to convey some information, even if they don't render nicely on the arXiv side. That's not an issue there, but it becomes an issue for downstream services such as INSPIRE or ADS, which offer LaTeX based export formats and need to know whether a title is valid LaTeX to decide whether to escape it (as you're trying to do), or simply pass it through.
FYI, the strategy we've adopted is two-fold: we try to decode LaTeX macros outside of math mode when it's possible to do so without too much loss during harvesting (we're using a suitably configured pylatexenc https://github.com/inspirehep/hepcrawl/blob/2f2b0fb2251700a08ffe75394c0b19980267e8b3/hepcrawl/parsers/arxiv.py#L49-L91 for that). The untranslated bits might be valid macros but there's no way to know, so we use pylatexenc again to encode https://github.com/inspirehep/inspirehep/blob/9e8060d78172520614bf007bf3ba8291da5288dd/backend/inspirehep/records/marshmallow/literature/utils.py#L37-L66 the whole thing when generating the LaTeX export formats.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adsabs/bumblebee/issues/2135#issuecomment-836679140, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG3M4CEF4HYXQCUMPMIS55LTM7LH3ANCNFSM4ZUZZX6Q .
The actual text in the title field is this: Measurement of the \Sigma\ beam asymmetry for the \omega\ photo-production off the proton and the neutron at GRAAL
, so these are single slashes (they appear double within json since the first backslash is an escape).
Note that the data problem comes directly from arXiv as there are no math mode characters in the title surrounding the greek letters.
But to start, we should do what Micha suggested: outside of math mode replace a single backslash with \textbackslash{}
rather than \textbackslash
While trying to improve handling of LaTeX in arXiv titles in INSPIRE (which are basically unstructured strings and might or might not contain LaTeX macros, which is problematic when wanting to make sense of them in BibTeX/LaTeX citation snippets), I checked how you're doing things, and noticed that you're wrongly escaping backslashes, producing invalid LaTeX.
Apologies if this is not the right place to report this bug, I know nothing about your architecture and this repo seemed the most active.
Expected Behavior
bibtex would contain something like
in order to get a compilable title (Greek letters are not allowed outside of math mode by default). That's very hard to achieve as you'd need to somehow interpret the title. A valid fix to your current approach would be
Note the addition of
{}
after the inserted macros to make sure they're not glued to the next word and spaces after the macro don't get swallowed.Actual Behavior
bibtex output contains
but
\textbackslashSigma
is not a valid macro, and\textbackslash beam
eats the space, producing\beam
in the output.Steps to Reproduce
Go to https://ui.adsabs.harvard.edu/abs/2013arXiv1306.5943V/exportcitation and look at the
title
in the bibtex snippet.