better error handling in bibmarkup.bibtex_to_html

kmccurley commented 8 months ago

Apparently some bibtex entries cause errors in bibmarkup.bibtex_to_html (@jwbos made reference to a reference "B97" but I can't find it). The problem here is that BibTeX has no formally written grammar for bibtex files, and the only expression of this is in the bibtex binary itself (which is written in WEB, and translated into even more unreadable C). It turns out that the biber and bibtex binaries have different behaviors because of this, and the situation for python parsers is even worse. We're using pybtex, and the first time it encounters an error it just gives up. Notably, this may happen even for legitimate bibtex files. The error reporting in pybtex is also weird, because it was designed to go to stdout so you have to use

with pybtex.errors.capture() as captured_errors:
...

to capture the errors while it is running. I think the first time it encounters an error it quits.

One solution would be to do a pre-parse of the bibtex file and try to parse the entries one-by-one, but even that is fraught with peril because it's hard to recognize when you have hit the end or start of an entry.

We should try hard to improve bibmarkup_test.py and supply error cases. Perhaps we can try a project to process a large number of bibtex files looking for errors.

kmccurley commented 8 months ago

I found at least one example where bibtex and pybtex have different behavior:

@inproceedings{C:CLLZ21,
  authors    = {Andrea Coladangelo and
               Jiahui Liu and
               Qipeng Liu and
               Mark Zhandry},
  editor    = {Tal Malkin and
               Chris Peikert},
  title     = {Hidden Cosets and Applications to Unclonable Cryptography},
  booktitle = {Advances in Cryptology - {CRYPTO} 2021 - 41st Annual International
               Cryptology Conference, {CRYPTO} 2021, Virtual Event, August 16-20,
               2021, Proceedings, Part {I}},
  series    = {Lecture Notes in Computer Science},
  volume    = {12825},
  pages     = {556--584},
  publisher = {Springer},
  year      = {2021},
  url       = {https://doi.org/10.1007/978-3-030-84242-0\_20},
  doi       = {10.1007/978-3-030-84242-0\_20},
  timestamp = {Mon, 16 Aug 2021 09:08:14 +0200},
  biburl    = {https://dblp.org/rec/conf/crypto/ColadangeloLLZ21.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

This entry has authors = instead of author =. The bibtex binary produces a malformed entry without author, and issues a warning: Warning--empty author in C:CLLZ21 whereas we consider author to be a required field and complain about it. Other things like missing booktitle in @inproceedings may produce the same problem. Note that we now have a bibtex log parser, so the warning message from bibtex shows up as a warning in our system. We could escalate these to errors, but it feels weird to have our system more restrictive than the bibtex binary itself. Also, Dan Bernstein pointed out that he was tempted to make everything a @misc entry to bypass any checking for fields. Some authors really don't give a damn about their references.

kmccurley commented 8 months ago

I've experimented with several bibtex parsers using a set of 670 bibtex files that authors have uploaded.

pybtex has the advantage that it has satisfactory built-in HTML conversion. Unfortunately when it encounters the first syntax error in the file, it crashes. It fails on 189 of the 670 bibtex files.
bibtexparser 1.4.1 (from pypi) has no HTML conversion, and approximately the same behavior. It crashes on 90 of the files.
bibtexparser 2.0 (from github) is a new version 2 that does not crash. This is installed with python3 -m pip install bibtexparser --pre. It is able to parse all 670 files, but it omits several hundred blocks of the files when it encounters problems. Most of these seem to be duplicate keys or duplicate field keys. It has no HTML conversion.

It seems that we need to use one library for parsing and validating, and another for HTML generation. Overall it seems that the error handling from bibtexparser 2.0 is far superior, and we can use it to split the bibtex file into the different entries. We can then use pybtex to format those as HTML, and catch errors on individual entries.

jwbos commented 8 months ago

This seems like the proper way to proceed and will increase the user experience significantly!

kmccurley commented 8 months ago

It seems that the pybtex library for converting bibtex to html is riddled with bugs. Dan Bernstein has noted quite a few:

URLs with tilde get the tilde destroyed. That can be solved in the bibtex by writing \%7E for ~, but authors should not have to.
the library seems to insert spaces in weird places in URLs. Things like https:// in the path of the URL (as archive.org has) don't work.
\url{} fails inside the note and howpublished fields.
\cite fails if used in a bibtex field. This is kind of a rare case, but some have used it. cryptobib has things like annote = "Full version of \cite{FOCS:NaoRei95}",
there are a lot of fields that are ignored by the HTML converter. annote is a common one.
\cite is not handled within a bibtex field, so you can't refer to another reference within a reference. There are a lot of those in cryptobib.

We are not the first to have encountered this problem. The pybtex formatting language is very poorly documented and as far as I can tell, nobody has built anything with it. Just trying to figure out how to change \url{https://foo.bar/} into <a href="https://foo.bar/">https://foo.bar/</a> was a challenge. Perhaps in the short term we should suppress the HTML formatting of references until we come up with a better solution.

kmccurley commented 8 months ago

I found at least one problem that was orthogonal to pybtex, namely that bibexport will wrap a url field if it's too long, which was causing the archive URLs in the Bernstein paper to have spaces in them. By default bibexport uses a bibtex style called export.bst, and for some strange reason this wraps lines in the bibtex output. This is incredibly stupid, but TeX dates back to pre-web days and they always thought of URLs to be formatted on paper - god forbid they should stick out into the margin. LaTeX has been dragged kicking and screaming into the web world.

Luckily, this can be fixed by using our own bst file. We can invoke our own bst file with the option

bibexport -o main.bib -b wideexport.bst main.aux

(added the -b option to use a different bst file).

IACR / latex-submit

better error handling in bibmarkup.bibtex_to_html #63