Add import checks - Githubissues

acl-org / acl-anthology

Data and software for building the ACL Anthology.

https://aclanthology.org

Apache License 2.0

424 stars 283 forks source link

Add import checks #130

Open mjpost opened 5 years ago

mjpost commented 5 years ago

Ingestion is currently a largely manually and therefore error-prone process. It would be very helpful to have scripts that checked for various common problems. Some of these could include:

Extract the title from a PDF and ensure it lines up with the metadata in order to avoid problems like this one.
Look for instances of missing LaTeX-protected capitalization
Ensure the abstract is present
Look for LaTeX artifacts
Look for PDF cut-and-paste artifacts in the abstract (e.g., hyphenation)
Look for names that are all lowercase (usually a mistake)
(Many others)

davidweichiang commented 5 years ago

I'm not sure if this belongs under #136 or here, but I managed to do a readable diff of the Anthology XML against the BibTeX files. (TIL about Python's excellent difflib module.)

Main takeaways:

Most changes look right
There are some issues with author names
There are some places, but not that many, where some information has been lost
There are plenty of places where fixed-case marking has been lost
There are a number of files where the BibTeX reader died, so the diff couldn't be done

davidweichiang commented 5 years ago

Copying some frequent issues from the now-closed PR #140 so they don't get lost. These are nasty bugs in the current bib->xml conversion that need to be fixed.

\te incorrectly maps to ẹ (e with underdot). This happens even when \te is not followed by space, so \textit gets mapped to ẹxtit.
Strings like Author{1}{Affiliation} creep into the abstracts, starting around 2017.
Many acute accents were incorrectly changed to graves, over e, i, and A.

Here's an especially ironic instance of the last issue: From pecher to pêcher... or pècher: Simplifying French Input by Accent Prediction

danielgildea commented 5 years ago

* `\te` incorrectly maps to `ẹ` (e with underdot). This happens even when `\te` is not followed by space, so `\textit` gets mapped to `ẹxtit`.

This was fixed by commit [4744520]

* Strings like `Author{1}{Affiliation}` creep into the abstracts, starting around 2017.

Not sure how this is happens - these are in the bibtex files before conversion to xml.

* Many acute accents were incorrectly changed to graves, over `e`, `i`, and `A`.

The problem with A is fixed by the commit above. The problem with e and i I have not been able to reproduce.