Closed mjpost closed 5 years ago
I was working on this yesterday and while reading the MODS xsd (http://www.loc.gov/standards/mods/mods-3-7-announcement.html) I started to suspect the ACL's mods xml is not conforming to the style sheet. Do we want the new xmls to be conformant? I don't think it would be much work but can affect the downstream tools.
To give you more details --
the root element does not contain xml namespace,
the attribute "version" (from Matt's example link above) should be the version of mods, not version of the document/record -- that should be done with
Yes, please correct these. There should be a schema that we can validate against.
Just to elaborate here: there is existing code for parsing the authoritative XML. The code is in anthology.py, and it is used, for example, by xml_to_yaml.py. It would be great if you could also use this library to parse the XML, so that we would have a common framework for reading. Your script (xml_to_mods_xml.py?) will then generate the MODs format that we will use in export.
I was looking at bibutils
today, and noticed that it can also convert BibTeX to MODS XML. Since we already have an XML to BibTeX conversion (#122), have we considered doing everything else via bibutils
?
It's worth trying. I wonder how it handles protected caps. Can you run it once and see?
I can look into it during the weekend. Sorry for being inactive in this for that long. y.
On Thu, Mar 7, 2019 at 12:00 PM Matt Post notifications@github.com wrote:
It's worth trying. I wonder how it handles protected caps. Can you run it once and see?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/121#issuecomment-470608107, or mute the thread https://github.com/notifications/unsubscribe-auth/AKisX7Mh0I1oJwKWKkC73Yr8XGxFMZ62ks5vUUW9gaJpZM4aZBZS .
I've been playing it with it for a while and it might be usable...
For example conversion of this bib: https://aclanthology.info/papers/P15-1165/p15-1165.bib
(I'm using it for no other reason than it has a lot of unicode characters in names)
passes validation vs 3.7 MODS schema.
One notable thing is that bib2xml should be supplied the parameter -nt
, otherwise it splits booktitle into <title>
and <subtitle>
using some heuristics (probably because of the colon in the title of the proceedings).
As for the protected caps, I modified the title to {NLP}
and it was converted to NLP
.
Other than that, I think it keeps the casing exactly as is in the bib file.
It correctly decodes the latex entities (such as {\v{Z}}eljko}
) into correct Unicode codepoints
It seems the whole-proceedings-bib-files can be processed also without any problem (with the exception of having to supply -nt
)
For completness, version of bibutils:
bib2xml, bibutils suite version 6.7 date 2018-08-31
Thanks for testing this @jtrmal, I'm integrating this into the generation pipeline for now.
The Anthology currently uses
bibutils
to generate the bibliography files (BibTeX, MS Word, and so on). This is done from a MODS XML file, which is generated from a database, which is in turn generated from our authoritative XML files.For the static rewrite, we will like to keep the use of
bibutils
to generate non-BibTeX citation formats, so we need MODS XML. But we should generate that directly from the authoritative XML, bypassing the database.This script should be written to do the conversion. A good test will be to then run
bibutils
and see if we get the same result (or a better result, if we fix bugs 94 51 caused by the current pipeline).The inputs are the authoritative XML files, which can be found in the
import/
directory. Each file there contains all the papers for a proceedings. An example of a MODS-format XML can be found on any page for a specific paper (for example)—just click the MODS XML button.