Write XML → mods XML script

acl-org / acl-anthology

Data and software for building the ACL Anthology.

https://aclanthology.org

Apache License 2.0

438 stars 297 forks source link

Write XML → mods XML script #121

Closed mjpost closed 5 years ago

mjpost commented 5 years ago

The Anthology currently uses bibutils to generate the bibliography files (BibTeX, MS Word, and so on). This is done from a MODS XML file, which is generated from a database, which is in turn generated from our authoritative XML files.

For the static rewrite, we will like to keep the use of bibutils to generate non-BibTeX citation formats, so we need MODS XML. But we should generate that directly from the authoritative XML, bypassing the database.

This script should be written to do the conversion. A good test will be to then run bibutils and see if we get the same result (or a better result, if we fix bugs 94 51 caused by the current pipeline).

The inputs are the authoritative XML files, which can be found in the import/ directory. Each file there contains all the papers for a proceedings. An example of a MODS-format XML can be found on any page for a specific paper (for example)—just click the MODS XML button.

jtrmal commented 5 years ago

I was working on this yesterday and while reading the MODS xsd (http://www.loc.gov/standards/mods/mods-3-7-announcement.html) I started to suspect the ACL's mods xml is not conforming to the style sheet. Do we want the new xmls to be conformant? I don't think it would be much work but can affect the downstream tools.

To give you more details -- the root element does not contain xml namespace, the attribute "version" (from Matt's example link above) should be the version of mods, not version of the document/record -- that should be done with and some other things I'm still trying to understand (I'm not fully fluent in xsd)

mjpost commented 5 years ago

Yes, please correct these. There should be a schema that we can validate against.

mjpost commented 5 years ago

Just to elaborate here: there is existing code for parsing the authoritative XML. The code is in anthology.py, and it is used, for example, by xml_to_yaml.py. It would be great if you could also use this library to parse the XML, so that we would have a common framework for reading. Your script (xml_to_mods_xml.py?) will then generate the MODs format that we will use in export.

mbollmann commented 5 years ago

I was looking at bibutils today, and noticed that it can also convert BibTeX to MODS XML. Since we already have an XML to BibTeX conversion (#122), have we considered doing everything else via bibutils?

mjpost commented 5 years ago

It's worth trying. I wonder how it handles protected caps. Can you run it once and see?

jtrmal commented 5 years ago

I can look into it during the weekend. Sorry for being inactive in this for that long. y.

On Thu, Mar 7, 2019 at 12:00 PM Matt Post notifications@github.com wrote:

It's worth trying. I wonder how it handles protected caps. Can you run it once and see?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/121#issuecomment-470608107, or mute the thread https://github.com/notifications/unsubscribe-auth/AKisX7Mh0I1oJwKWKkC73Yr8XGxFMZ62ks5vUUW9gaJpZM4aZBZS .

jtrmal commented 5 years ago

I've been playing it with it for a while and it might be usable... For example conversion of this bib: https://aclanthology.info/papers/P15-1165/p15-1165.bib (I'm using it for no other reason than it has a lot of unicode characters in names) passes validation vs 3.7 MODS schema. One notable thing is that bib2xml should be supplied the parameter -nt, otherwise it splits booktitle into <title> and <subtitle> using some heuristics (probably because of the colon in the title of the proceedings).

As for the protected caps, I modified the title to {NLP} and it was converted to NLP. Other than that, I think it keeps the casing exactly as is in the bib file. It correctly decodes the latex entities (such as {\v{Z}}eljko}) into correct Unicode codepoints

jtrmal commented 5 years ago

It seems the whole-proceedings-bib-files can be processed also without any problem (with the exception of having to supply -nt) For completness, version of bibutils: bib2xml, bibutils suite version 6.7 date 2018-08-31

mbollmann commented 5 years ago

Thanks for testing this @jtrmal, I'm integrating this into the generation pipeline for now.