Open davidweichiang opened 5 years ago
So perhaps I should start by unifying normalize_anth.py
and protect.py
and adding a command-line switch to enable the non-idempotent parts.
One possibility I can think of is for the {START, EasyChair, TACL/CL}-to-Anthology tools to generate XML that validates, but not be concerned about details of the element contents, and make normalize_xml.py do the rest:
I think this is exactly right. The *-to-Anthology tools should effectively do the minimal conversion from BibTeX into our XML, and then we can normalize.
Consolidating normalize_anth.py
and protect.py
would be a great start, and I also like the idea of using command-line switches to distinguish the non-idempotent parts.
Note that I am also in the process of merging easy2acl
into ACLPUB.
Currently there are multiple pipelines that generate Anthology XML:
ACLPUB converts from START metadata to BibTeX, and
anthologize
converts from BibTeX to XML.easy2acl
tacl_cl_parser.py
These tools don't generate XML, but operate on existing XML:
normalize_xml.py
was used to update all the old XML files to new standards; it shares code withanthologize
but is a little bit behind.protect.py
guesses where<fixed-case>
tags should go.auto_name_variants.py
guesses name variants.How should we avoid code duplication between all of these, and ensure that they all produce the same result?
One possibility I can think of is for the {START, EasyChair, TACL/CL}-to-Anthology tools to generate XML that validates, but not be concerned about details of the element contents, and make
normalize_xml.py
do the rest:<i>
,<b>
,<url>
,<fixed-case>
,<tex-math>
<fixed-case>
tagsIt would be nice if this normalization were idempotent. Unfortunately it's impossible to make the TeX-to-Unicode idempotent, but the other steps could be.