acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
431 stars 288 forks source link

Unmatched curly braces in certain entries #2505

Closed dowobeha closed 1 year ago

dowobeha commented 1 year ago

Issue description

When loading the entire anthology+abstracts.bib into JabRef, JabRef fails. Further investigation shows that this appears to be triggered by entries which have mismatched curly braces within an entry.

Steps to reproduce the issue

  1. Download anthology+abstracts.bib
  2. Open anthology+abstracts.bib in JabRef

What's the expected result?

The entire bibtex file would be opened.

What's the actual result?

Only entries before the first mismatched curly braces are loaded.

Additional details / screenshot

The first example in the download (as of 1 May 2023) is in the abstract of afanasev-2023-use, which contains the following:

... accuracy of multilingual models by 3 to 15{{\%}. 

It seems that ideally, the code that generates this bib file should check for such mismatched curly braces when generating the bibtex file.

mbollmann commented 1 year ago

So in this particular case, the opening curly brace is clearly a typo in the original abstract and could just be removed.

More generally though, I found that the latexcodec module we're using explicitly acknowledges not converting curly braces as a bug:

https://github.com/acl-org/acl-anthology/blob/cc1e6b15c9b31b0de0fd9d2f1ffed70858b08cc3/bin/anthology/latexcodec.py#L34-L37

What I don't quite understand is the why, though. I tried adding the codepoints for curly braces to the module and it seemed to work fine. @danielgildea @davidweichiang , does one of you remember the reason for this module ignoring curly braces?

mbollmann commented 1 year ago

There appears to be more to this than a typo; we seem to have a lot of LaTeX commands in abstracts where the backslash has been replaced by an opening curly brace.

For example, the abstract of 2022.gebnlp-1.5 (emphasis mine):

People frequently interact with information retrieval (IR) systems, however, IR models exhibit biases and discrimination towards various demographics. The in-processing fair ranking methods provides a trade-offs between accuracy and fairness through adding a fairness-related regularization term in the loss function. However, there haven’t been intuitive objective functions that depend on the click probability and user engagement to directly optimize towards this.In this work, we propose the {textbf{I}n-{textbf{B}atch {textbf{B}alancing {textbf{R}egularization (IBBR) to mitigate the ranking disparity among subgroups. In particular, we develop a differentiable {textbf{normed Pairwise Ranking Fairness} (nPRF) and leverage the T-statistics on top of nPRF over subgroups as a regularization to improve fairness. Empirical results with the BERT-based neural rankers on the MS MARCO Passage Retrieval dataset with the human-annotated non-gendered queries benchmark {cite{rekabsaz2020neural} show that our {ibbr{} method with nPRF achieves significantly less bias with minimal degradation in ranking performance compared with the baseline.

Running ack -c -l "{text" to find affected abstracts suggests that this has started to show up sometime in 2022:

xml/2022.case.xml:1
xml/2022.gebnlp.xml:1
xml/2022.nllp.xml:1
xml/2022.tsar.xml:1
xml/2022.wmt.xml:4
xml/2023.bsnlp.xml:1
xml/2023.eacl.xml:3
xml/2023.findings.xml:1

I wonder if there's a bug in our code or in ACLPUB2 or whatever package was used to produce these. @mjpost @anthology-assist Is there a way to check how these affected proceedings were produced?

nschneid commented 1 year ago

Just encountered this for paper https://aclanthology.org/2023.eacl-main.76/ - Zotero import fails due to {\$}{mathcal{V}{\$}-information, in the abstract. Another case where a TeX backslash has been replaced with {.

danielgildea commented 1 year ago

I see that I tried to add textbf to texmath.py in 2022. I think that must be the source of the problem, sorry. Looking at the code, however, I'm not exactly why it is producing unmatched braces.

mbollmann commented 1 year ago

But texmath.py takes the input from the XML and converts it for the website. The unmatched braces are already in the input XML, so it can't be that.

danielgildea commented 1 year ago

But texmath.py takes the input from the XML and converts it for the website. The unmatched braces are already in the input XML, so it can't be that.

oh, good point. Could someone post the source files for the EACL 2023 ingestion?

danielgildea commented 1 year ago

I think this is fixed now, can you please test?

danielgildea commented 1 year ago

Tested that the bibtex works with JabRef. Fixed by #2501