Open kmccurley opened 8 months ago
I found at least one example where bibtex and pybtex have different behavior:
@inproceedings{C:CLLZ21,
authors = {Andrea Coladangelo and
Jiahui Liu and
Qipeng Liu and
Mark Zhandry},
editor = {Tal Malkin and
Chris Peikert},
title = {Hidden Cosets and Applications to Unclonable Cryptography},
booktitle = {Advances in Cryptology - {CRYPTO} 2021 - 41st Annual International
Cryptology Conference, {CRYPTO} 2021, Virtual Event, August 16-20,
2021, Proceedings, Part {I}},
series = {Lecture Notes in Computer Science},
volume = {12825},
pages = {556--584},
publisher = {Springer},
year = {2021},
url = {https://doi.org/10.1007/978-3-030-84242-0\_20},
doi = {10.1007/978-3-030-84242-0\_20},
timestamp = {Mon, 16 Aug 2021 09:08:14 +0200},
biburl = {https://dblp.org/rec/conf/crypto/ColadangeloLLZ21.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
This entry has authors =
instead of author =
. The bibtex
binary produces a malformed entry without author, and issues a warning: Warning--empty author in C:CLLZ21
whereas we consider author
to be a required field and complain about it. Other things like missing booktitle in @inproceedings
may produce the same problem. Note that we now have a bibtex log parser, so the warning message from bibtex
shows up as a warning in our system. We could escalate these to errors, but it feels weird to have our system more restrictive than the bibtex
binary itself. Also, Dan Bernstein pointed out that he was tempted to make everything a @misc
entry to bypass any checking for fields. Some authors really don't give a damn about their references.
I've experimented with several bibtex parsers using a set of 670 bibtex files that authors have uploaded.
python3 -m pip install bibtexparser --pre
. It is able to parse all 670 files, but it omits several hundred blocks of the files when it encounters problems. Most of these seem to be duplicate keys or duplicate field keys. It has no HTML conversion.It seems that we need to use one library for parsing and validating, and another for HTML generation. Overall it seems that the error handling from bibtexparser 2.0 is far superior, and we can use it to split the bibtex file into the different entries. We can then use pybtex to format those as HTML, and catch errors on individual entries.
This seems like the proper way to proceed and will increase the user experience significantly!
It seems that the pybtex library for converting bibtex to html is riddled with bugs. Dan Bernstein has noted quite a few:
note
and howpublished
fields.annote = "Full version of \cite{FOCS:NaoRei95}",
annote
is a common one.\cite
is not handled within a bibtex field, so you can't refer to another reference within a reference. There are a lot of those in cryptobib.We are not the first to have encountered this problem. The pybtex formatting language is very poorly documented and as far as I can tell, nobody has built anything with it. Just trying to figure out how to change \url{https://foo.bar/}
into <a href="https://foo.bar/">https://foo.bar/</a>
was a challenge. Perhaps in the short term we should suppress the HTML formatting of references until we come up with a better solution.
I found at least one problem that was orthogonal to pybtex
, namely that bibexport
will wrap a url field if it's too long, which was causing the archive URLs in the Bernstein paper to have spaces in them. By default bibexport
uses a bibtex style called export.bst
, and for some strange reason this wraps lines in the bibtex output. This is incredibly stupid, but TeX dates back to pre-web days and they always thought of URLs to be formatted on paper - god forbid they should stick out into the margin. LaTeX has been dragged kicking and screaming into the web world.
Luckily, this can be fixed by using our own bst file. We can invoke our own bst file with the option
bibexport -o main.bib -b wideexport.bst main.aux
(added the -b
option to use a different bst file).
Apparently some bibtex entries cause errors in
bibmarkup.bibtex_to_html
(@jwbos made reference to a reference "B97" but I can't find it). The problem here is that BibTeX has no formally written grammar for bibtex files, and the only expression of this is in the bibtex binary itself (which is written in WEB, and translated into even more unreadable C). It turns out that thebiber
andbibtex
binaries have different behaviors because of this, and the situation for python parsers is even worse. We're using pybtex, and the first time it encounters an error it just gives up. Notably, this may happen even for legitimate bibtex files. The error reporting in pybtex is also weird, because it was designed to go to stdout so you have to useto capture the errors while it is running. I think the first time it encounters an error it quits.
One solution would be to do a pre-parse of the bibtex file and try to parse the entries one-by-one, but even that is fraught with peril because it's hard to recognize when you have hit the end or start of an entry.
We should try hard to improve
bibmarkup_test.py
and supply error cases. Perhaps we can try a project to process a large number of bibtex files looking for errors.