brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
915 stars 96 forks source link

Bib-file parsing issues #917

Closed asmaier closed 6 years ago

asmaier commented 6 years ago

I encountered the following errors when running latexmlpost on a *.bib-file of mine. With the original bibtex none of these issues caused problems, so I cannot say, if these errors come from bibtex being too sloppy or latexmlpost being too strict. However because they can cause followup errors in the parsing and latexmlpost seems to have a hardcoded limit of 100 errors above which it stops processing the bibliography, they can prevent one from successfully converting a *.bib-file. So I document them here:

  1. %-sign outsides the field url cause problems. E.g.

    @ARTICLE{Bryan1997,
    author = {Bryan, Greg L. and Norman, Michael L.},
    title = {{A Hybrid AMR Application for Cosmology and Astrophysics}},
    year = {1997},
    eprint = {astro-ph/9710187},
    pdf = {Bryan1997.pdf},
    slaccitation = {%%CITATION = ASTRO-PH 9710187;%%},
    }
    @ARTICLE{Ensslin2006,
    author = {Enßlin, T.~A. and Vogt, C.},
    title = {{Magnetic turbulence in cool cores of galaxy clusters}},
    journal = {A\&A},
    year = {2006},
    volume = {453},
    pages = {447-458},
    month = jul,
    adsnote = {Provided by the SAO/NASA Astrophysics Data System},
    adsurl = {http://adsabs.harvard.edu/abs/2006A%26A...453..447E},
    doi = {10.1051/0004-6361:20053518},
    eprint = {arXiv:astro-ph/0505517},
    keywords = {galaxies: cluster: general, cooling flows, magnetic
     fields, turbulence, X-rays: galaxies: clusters, intergalactic medium},
    }

    Removing the field slaccitationand renaming the field adsurl to url fixed the errors.

  2. Be careful with the field month:

    @ARTICLE{Kiessling2003,
    author = {Kiessling, M.K.-H.},
    title = {{The ''Jeans swindle'' - A true story-mathematically speaking}},
    journal = {Advances in Applied Mathematics},
    year = {2003},
    volume = {31},
    pages = {132-149(18)},
    month = july,
    doi = {doi:10.1016/S0196-8858(02)00556-0 },
    pdf = {Kiessling2003.pdf},
    url = {http://www.ingentaconnect.com/content/els/01968858/2003/00000031/00000001/art00556},
    }
    @ARTICLE{Veynante2002,
    author = {Veynante, D. and Vervisch, L.},
    title = {{Turbulent combustion modeling}},
    journal = {Progress in Energy and Combustion Science},
    year = {2002},
    volume = {28},
    pages = {193-266(74)},
    month = March,
    doi = {doi:10.1016/S0360-1285(01)00017-X},
    pdf = {Veynante2002.pdf},
    url = {http://www.ingentaconnect.com/content/els/03601285/2002/00000028/00000003/art00017},
    }

    You must either use the correct macro for the month field, e.g. month = jul (and not july), or you must use curly brackets month = {March} (and not month = March) (see also https://tex.stackexchange.com/questions/70455/bibtex-month-format) .

  3. Math symbols and operators must be put between $..$, e.g. the following will cause problems with latexmlpost:

    @INPROCEEDINGS{Norman1999,
    author = {Norman, M.~L. and Bryan G.~L.},
    title = {{Cosmological Adaptive Mesh Refinement^{CD}}},
    booktitle = {ASSL Vol. 240: Numerical Astrophysics},
    year = {1999},
    pages = {19-+},
    adsnote = {Provided by the NASA Astrophysics Data System},
    pdf = {Norman1999.pdf},
    url = {http://adsabs.harvard.edu/cgi-bin/nph-bib_query?bibcode=1999numa.conf...19N&db_ key=AST},
    }

    To fix this write Refinement$^{CD}$ in the title field.

  4. The ambersand symbol & can cause parsing errors

    
    @ARTICLE{Shyy1997,
    author = {Shyy, W. and Krishnamurty, V.S.},
    title = {{Compressibility effects in modeling complex turbulent flows}},
    journal = {Progress in Aerospace Sciences},
    year = {1997},
    volume = {33},
    pages = {587-645(59)},
    abstract = {... In the present review, the
     compressibility effect is investigated in the context of engineering models
     needed for complex flow computations, particularly the k-&unknown;
     model. ...},
    doi = {doi:10.1016/S0376-0421(97)00005-5},
    pdf = {Shyy1997.pdf},
    url = {http://www.ingentaconnect.com/content/els/03760421/1997/00000033/00000009/art00005},
    }

Replacing `k-&unknown;` with the correct `k-$\epsilon$` in the `abstract` field was necessary to get rid of the parsing error. 

It would be nice, if these issues could be fixed in `latexmlpost` or at least give a reasonable error message, see also #916 . 
brucemiller commented 6 years ago

Yeah, this is tricky. Firstly, bibtex is indeed forgiving in its design: it never actually processes any TeX/LaTeX. It only rearranges it, according to a bibliography style file (bst), in most cases dropping the data it's not interested in. LaTeXML attempts to process, convert and preserve all the data with the goal of producing an xml representation of the data which can (hopefully) be useful on its own.

Of course, this gets screwed up by not knowing enough about the type of data in each field: such as your adsurl being a form of url; and slaccitation being who-knows-what. LaTeXML's bibtex engine knows the types of the standard fields; perhaps it should process unknown ones as if they were verbatim? But that might be unexpected to some. We've done experiments reading the *.bst files, be even then the types are at best impliicit in the style file. Hmm...

And of course, this is all made worse by the fact that LaTeXML rewrites the bib into a more TeX-like form before processing, and looses track of where the original source was, so the error messages become even more incomprehensible.

Undoubtedly this can all be improved, but needs some thought... and maybe some 'votes'...

brucemiller commented 6 years ago

To solve the 1st item, LaTeXML should probably treat any unknown fields as completely verbatim; I've made that patch. That avoids errors in this more common scenario, but will lead to less-than-optimal output if the user expected the field to be processed as markup. In the latter case, they can still declare the fields to fix it up.

I'm not sure what your point in items 2,3,4 is; LaTeXML does give errors for these cases, as latex would (after bibtex processing). I think this is just a dup of #916, that those errors are hidden if the bibliography is processed during postprocessing.