We should do sanity checks on uploaded bibtex files.

IACR / latex-submit

Web server to receive uploaded LaTeX and execute it in a docker container.

GNU Affero General Public License v3.0

11 stars 0 forks source link

We should do sanity checks on uploaded bibtex files. #23

Closed kmccurley closed 11 months ago

kmccurley commented 1 year ago

If something is uploaded for iacrcc.cls, then there is a check that runs on the metadata of citations to make sure that everything has a DOI. For other document classes, we could run a bibtex parser and check the quality of references there. Since the bibtex file may contain unused references, we should extract things like \citation{galindo2021fully} from the main.aux file and check only those references.

kmccurley commented 1 year ago

Every time I try to do something with BibTeX I am reminded of what it is like to work with stone tools. It turns out that there is no formal grammar for the BibTeX file format, and the only definition is in the code. Many people have tried to write bibtex parsers, with varying degrees of success.

bibpy is one attempt. It mangles some non-ascii characters
pybtex looks promising, and it's what cryptobib uses. It also has problems.
biblib (note: not the one from pypi) claims to be the only one that implements the correct grammar defined in the bibtex binary. It no longer works in python 3.10 and has not been updated in ten years.
bibtexparser is at least maintained, but it appears to also have problems.

This reminds me why we didn't try to parse bibtex directly. How do you parse a format that is described only by a binary?

kmccurley commented 1 year ago

Note: bibtexparser will not parse cryptobib because it fails with @string{acisped = ""}

kmccurley commented 1 year ago

The iacrcc.cls style has switched to using alphaurl bibliography style and iacrcc.bst is being dropped. As a result, we no longer generate bibliographic references in an ad-hoc format from the iacrcc.bst style. This means that the citations element of metadata/Compilation:Meta is no longer needed, and we can instead just capture the bibtex references that are being used. This was mentioned in this issue where it was suggested that we can use either bibexport or pybtex to extract the original bibtex entries.

Bibliographic entries need to be converted into other formats:

HTML for the web pages.
XML for crossref and/or JATS.
XML for XMP when we use an extended schema. There will undoubtedly be issues in converting to these, but I think it's best to just store the original BibTeX entries and solve problems in converting them to other formats.

kmccurley commented 11 months ago

The code now uses a combination of bibexport and pybtex to extract and check the bibtex entries in webapp/metadata/meta_parse.py. It also uses pybtex to conver the references to both JATS and crossref format.

The extraction of bibtex entries is tricky when the author uses biblatex because bibexport only supports bibtex output. I fake it by parsing the main.bcf file (it's XML) and creating a fake main.aux to parse with bibexport.