Synchronize XML import pipelines

davidweichiang commented 5 years ago

Currently there are multiple pipelines that generate Anthology XML:

ACLPUB converts from START metadata to BibTeX, and anthologize converts from BibTeX to XML.
easy2acl
tacl_cl_parser.py
These tools don't generate XML, but operate on existing XML:
- normalize_xml.py was used to update all the old XML files to new standards; it shares code with anthologize but is a little bit behind.
- protect.py guesses where <fixed-case> tags should go.
- auto_name_variants.py guesses name variants.

How should we avoid code duplication between all of these, and ensure that they all produce the same result?

One possibility I can think of is for the {START, EasyChair, TACL/CL}-to-Anthology tools to generate XML that validates, but not be concerned about details of the element contents, and make normalize_xml.py do the rest:

Convert TeX to Unicode and <i>, <b>, <url>, <fixed-case>, <tex-math>
Convert straight to curly quotes
Guess missing <fixed-case> tags
Guess name variants or author ids?
Split names into first and last if missing?

It would be nice if this normalization were idempotent. Unfortunately it's impossible to make the TeX-to-Unicode idempotent, but the other steps could be.

davidweichiang commented 5 years ago

So perhaps I should start by unifying normalize_anth.py and protect.py and adding a command-line switch to enable the non-idempotent parts.

mjpost commented 5 years ago

One possibility I can think of is for the {START, EasyChair, TACL/CL}-to-Anthology tools to generate XML that validates, but not be concerned about details of the element contents, and make normalize_xml.py do the rest:

I think this is exactly right. The *-to-Anthology tools should effectively do the minimal conversion from BibTeX into our XML, and then we can normalize.

Consolidating normalize_anth.py and protect.py would be a great start, and I also like the idea of using command-line switches to distinguish the non-idempotent parts.

mjpost commented 5 years ago

Note that I am also in the process of merging easy2acl into ACLPUB.

acl-org / acl-anthology

Synchronize XML import pipelines #293