acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
371 stars 251 forks source link

Synchronize XML import pipelines #293

Open davidweichiang opened 5 years ago

davidweichiang commented 5 years ago

Currently there are multiple pipelines that generate Anthology XML:

How should we avoid code duplication between all of these, and ensure that they all produce the same result?

One possibility I can think of is for the {START, EasyChair, TACL/CL}-to-Anthology tools to generate XML that validates, but not be concerned about details of the element contents, and make normalize_xml.py do the rest:

It would be nice if this normalization were idempotent. Unfortunately it's impossible to make the TeX-to-Unicode idempotent, but the other steps could be.

davidweichiang commented 5 years ago

So perhaps I should start by unifying normalize_anth.py and protect.py and adding a command-line switch to enable the non-idempotent parts.

mjpost commented 5 years ago

One possibility I can think of is for the {START, EasyChair, TACL/CL}-to-Anthology tools to generate XML that validates, but not be concerned about details of the element contents, and make normalize_xml.py do the rest:

I think this is exactly right. The *-to-Anthology tools should effectively do the minimal conversion from BibTeX into our XML, and then we can normalize.

Consolidating normalize_anth.py and protect.py would be a great start, and I also like the idea of using command-line switches to distinguish the non-idempotent parts.

mjpost commented 5 years ago

Note that I am also in the process of merging easy2acl into ACLPUB.