Open mjpost opened 5 years ago
I'm not sure if this belongs under #136 or here, but I managed to do a readable diff of the Anthology XML against the BibTeX files. (TIL about Python's excellent difflib
module.)
Main takeaways:
Copying some frequent issues from the now-closed PR #140 so they don't get lost. These are nasty bugs in the current bib->xml conversion that need to be fixed.
\te
incorrectly maps to ẹ
(e with underdot). This happens even when \te
is not followed by space, so \textit
gets mapped to ẹxtit
.Author{1}{Affiliation}
creep into the abstracts, starting around 2017.e
, i
, and A
.Here's an especially ironic instance of the last issue: From pecher to pêcher... or pècher: Simplifying French Input by Accent Prediction
* `\te` incorrectly maps to `ẹ` (e with underdot). This happens even when `\te` is not followed by space, so `\textit` gets mapped to `ẹxtit`.
This was fixed by commit [4744520]
* Strings like `Author{1}{Affiliation}` creep into the abstracts, starting around 2017.
Not sure how this is happens - these are in the bibtex files before conversion to xml.
* Many acute accents were incorrectly changed to graves, over `e`, `i`, and `A`.
The problem with A
is fixed by the commit above. The problem with e
and i
I have not been able to reproduce.
Ingestion is currently a largely manually and therefore error-prone process. It would be very helpful to have scripts that checked for various common problems. Some of these could include: