NIME-conference / NIME-bibliography

BibTeX files with information about all publications from the annual International Conference on New Interfaces for Musical Expression (NIME)
http://nime-conference.github.io/NIME-bibliography/
GNU General Public License v3.0
29 stars 15 forks source link

harmonise function incorrectly adds latex escaping bibtex fields #74

Open cpmpercussion opened 1 week ago

cpmpercussion commented 1 week ago

As observed by @stefanofasciani:

all capital letters in title are wrapped in {}, which is something I have not found in 2023.bib and earlier, and replaces all the non-ASCII characters with LaTex code (also something I have not found in previous bib files). Also, the harmonise function messes up the URL, for example {http://nime.org/proceedings/2024/nime2024_11.pdf} becomes {http://nime.org/proceedings/2024/nime2024\_11.pdf} which is deadly for the zenodo upload tool.

This is incorrect behaviour:

This is because the .bib file is in bibtex format but used to create other text representations of the papers (e.g., NIME individual paper webpages and Zenodo entries). So we need the text in the bibtex fields to be a "plain" UTF-8 representation of the text that could go into an HTML document or an API call, not something tuned to show up correctly in a LaTeX document.

The todo here is:

Ultimately we may want to move away from .bib files as a storage system, but they have an advantage of ubiquity within academic publishing and if the processes here break down at some point, the .bib files could easily be used in a different ad hoc system by other future maintainers.

stefanofasciani commented 1 week ago

it seems that the harmonise function is doing what we are asking to

The current version of the harmoniser function, uses the BibTexParser at line 36 with customization=homogenize_latex_encoding. So the behavior -- with respect to characters encoding -- is correct, while it's weird what happens to the title and url. Apparently BibTexParser has only built in customization as homogenize_latex_encoding or convert_to_unicode. If we use the latter, the strange behaviors disappear, and there are no apparent changes in the .bib file as the text is already unicode.

So we either need to develop a 'custom' customization (possible?), or perhaps see if migrating from BibTexParser 1.4 --> 2.0 is a viable option to get the UTF-8 code.