acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
424 stars 283 forks source link

Add import checks #130

Open mjpost opened 5 years ago

mjpost commented 5 years ago

Ingestion is currently a largely manually and therefore error-prone process. It would be very helpful to have scripts that checked for various common problems. Some of these could include:

davidweichiang commented 5 years ago

I'm not sure if this belongs under #136 or here, but I managed to do a readable diff of the Anthology XML against the BibTeX files. (TIL about Python's excellent difflib module.)

Main takeaways:

davidweichiang commented 5 years ago

Copying some frequent issues from the now-closed PR #140 so they don't get lost. These are nasty bugs in the current bib->xml conversion that need to be fixed.

Here's an especially ironic instance of the last issue: From pecher to pêcher... or pècher: Simplifying French Input by Accent Prediction

danielgildea commented 5 years ago
* `\te` incorrectly maps to `ẹ` (e with underdot). This happens even when `\te` is not followed by space, so `\textit` gets mapped to `ẹxtit`.

This was fixed by commit [4744520]

* Strings like `Author{1}{Affiliation}` creep into the abstracts, starting around 2017.

Not sure how this is happens - these are in the bibtex files before conversion to xml.

* Many acute accents were incorrectly changed to graves, over `e`, `i`, and `A`.

The problem with A is fixed by the commit above. The problem with e and i I have not been able to reproduce.