acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
431 stars 288 forks source link

Write XML → BibTeX conversion script #122

Closed mjpost closed 5 years ago

mjpost commented 5 years ago

We can generate the BibTeX entries directly from the authoritative XML files under import/. This will take a bit of work but will allow us to preserve components like title casing, properly generate the month field, and include the abstracts.

mjpost commented 5 years ago

Two things left to do on this issue:

mjpost commented 5 years ago

This needs to be updated to reflect the XML structure in titles. See @davidweichiang's notes that it stops reading when it encounters a subtag (here) and also the presence of non-XML entities (here)

mbollmann commented 5 years ago

I'm integrating the anth2bib.py script into the current Hugo generation pipeline, and have a question:

The authoritative XML often contains <bibtype> and <bibkey> fields that are also rendered on the website, but the current anth2bib.py doesn't consider them. It either should, or we should get rid of them in the XML.

I'm not sure if <bibkey> is just there for historical reasons or what the rationale behind it is; personally I find it much more logical to use the Anthology ID as the key, so would be in favour of dropping it and keeping the current behaviour of the conversion script -- unless there are other reasons for not doing it this way.

I'm equally unsure where <bibtype> comes from, but would probably suggest that the conversion script should respect it when it's there.

mjpost commented 5 years ago

I think the <bib*> keys are preserved from the ingestion process, and I think you're right that they're redundant and we don't need them.

As to the key to use: I don't like the idea of using the identifier because it carries no semantic information and is consequently a pain to use when writing papers. I personally prefer something like Google Scholar does, which is

{last name of first author}{year}{optional suffix}:{first content word of title}

Actually, Google doesn't insert the suffix because they don't care about key conflicts, but we will need to care about those. Perhaps that makes this approach unworkable, but we could augment this with the venue identifier to make it easier to compute.

davidweichiang commented 5 years ago

I would add that some of the existing keys are not valid because they contain things like curly braces. And some of them are unusably long.

mbollmann commented 5 years ago

@davidweichiang, would you mind if I refactored your code into my anthology module? That way the YAML generation could have access to the generated BibTeX keys, which would be necessary if they don't come directly from the XML anymore, I think.

mjpost commented 5 years ago

FTR, the current Anthology is using the Anthology ID, e.g., P18-1251 (the original ingested version isn‘t quite useful I think).

davidweichiang commented 5 years ago

@mbollmann Which code are you referring to?

danielgildea commented 5 years ago

It looks like the bib keys are created by:

https://github.com/acl-org/ACLPUB/blob/localv3/bin/bib.pl

and passed through by anthoBibs2xml.pl. I think these keys are too long, although somewhat better than the anthology ids.

(The bibtype for a proceedings should really be @Proceedings not @Book.)

mbollmann commented 5 years ago

@davidweichiang the printbib code in anth2bib.py. If that logic was in anthology/papers.py instead, we could also assign unique bibkeys in whatever format we prefer there, and the YAML export could access it to display the correct bibkey on the website.

(On the other hand, if we don't care about the "bibkey" entry on the website and say the bib entry download is good enough instead, there's no pressing need to change this right now.)

danielgildea commented 5 years ago

anth2bib.py was written by me. Sure, go ahead and refactor however makes sense.

davidweichiang commented 5 years ago

The LaTeX to UTF8 conversion in latex.py is still not complete; for example, it doesn't get ą -> {\k a}. A couple of ideas:

mbollmann commented 5 years ago

I'll work on refactoring this into the anthology/ module later today; maybe hold off on further bug fixes on the script until then. I'll post here where the new logic lives.

mbollmann commented 5 years ago

@danielgildea, what was the reason to include latex.py instead of going with latexcodec? The latter seems to be based on the former, but is actively maintained.

davidweichiang commented 5 years ago

I think the main reason was that he wanted é to be generated as {\'e}, which latexcodec does not do (but the above workaround gets).

Meanwhile, latexcodec is good but not perfect...I've got about 20 lines of workarounds for various issues, some of which the author has quickly fixed, but some which appear to have no fix.

mbollmann commented 5 years ago

I think the main reason was that he wanted é to be generated as {\'e}, which latexcodec does not do (but the above workaround gets).

It does for me? Or was this added in the meantime?

>>> import latexcodec
>>> import codecs
>>> print(codecs.encode("Tést", "ulatex"))
T\'est

Do you have your workarounds collected somewhere? Or are they integrated in our latex.py?

davidweichiang commented 5 years ago

For BibTeX, it should be {\'e}, not \'e or \'{e}. (See https://github.com/acl-org/acl-anthology/issues/135 and the BibTeX FAQ).

My workarounds are in bin/tex_unicode.py, which is only used in this repo for XML cleanup, but I have further modified it and it's going to move over to ACLPUB as part of anthologize.pl (or .py).

davidweichiang commented 5 years ago

I should have also said, my workarounds are only needed in the decoding direction. The encoding direction, I imagine, has fewer pitfalls. The develop branch should have 100% coverage for our data, I think.

mbollmann commented 5 years ago

Okay, I've kept our latex.py for now. I think it has some advantages to build around an already maintained package, but latex.py seems to have a lot of customizations for our use case (the curly braces around special characters are just one of them -- latexcodec chokes on Chinese characters, for example).

I've refactored @danielgildea's script into the anthology module:

I'm not sure this refactoring is ideal yet, but it gets rid of a lot of code duplication -- e.g., interpreting volume/issue numbers, deriving the publication year, cleaning up dashes in "pages", etc., all of which is already done in the anthology module.

davidweichiang commented 5 years ago

Here's an example of a Unicode char still getting passed through to BibTeX: http://anthology.aclweb.org/W/W16/W16-1815.bib

I think Chinese characters should work if you use latexcodec with codecs.decode('ulatex+utf8') (which tells it that it can pass UTF-8 characters through), but we're still stuck with the problem that BibTeX doesn't support Unicode. What bad thing happens -- does BibTeX just not know how to sort?

danielgildea commented 5 years ago

What bad thing happens -- does BibTeX just not know how to sort?

Yes, it's just a problem with sorting. Well, and the fact that you will get garbage in the pdf if you do not have a super recent latex where utf8 is the default for input files and you did not say \usepackage[utf8]{inputenc} in your latex file.

davidweichiang commented 5 years ago

Presumably it's also a problem for bibliography styles that have citations like [Gil19].

Since biber/biblatex also use .bib files and do handle UTF-8, I think it's fair for us to use UTF-8 in .bib files, but we should try to use as little as possible. So I suggest:

mbollmann commented 5 years ago

Another issue with our current latex.py (?): some characters seem to get escaped twice, as I just stumbled across ñ becoming {\\textasciitilde n} in P03-2003.

EDIT: Should we maybe take the time to write unit tests for the BibTeX generation? It seems there's a lot that can go wrong here.

davidweichiang commented 5 years ago

That looks doubly wrong, since this should become {\~n}. But I don't see where in the code this could be coming from.

danielgildea commented 5 years ago

Another issue with our current latex.py (?): some characters seem to get escaped twice, as I just stumbled across ñ becoming {\\textasciitilde n} in P03-2003.

I think it is getting re-escaped by the pybtex library. See also ampersands in titles: http://aclweb.org/anthology/papers/P/P14/P14-1131.bib

mbollmann commented 5 years ago

I think it is getting re-escaped by the pybtex library.

Welp. Seems like you really have to do everything yourself... I'll look into changing this tomorrow.

mjpost commented 5 years ago

We used pybtex in bibsearch, and I seem to remember having this exact same issue. We ended up just manually correcting it in postprocessing.

davidweichiang commented 5 years ago

Oh, I didn't realize you were using pybtex. It uses latexcodec to encode all fields (even url and doi, unfortunately).

https://bitbucket.org/pybtex-devs/pybtex/pull-requests/11/when-writing-bibtex-write-url-field-as-raw/diff

mbollmann commented 5 years ago

This is strange, as latexcodec (for me) encodes ø, é, å etc. while pybtex leaves them alone, so I assumed it left all content alone. I think for our use case I will just go back to printing the bib entry manually now, as we already take care of escaping/encoding separately.

(EDIT: Incidentally, not going through pybtex for printing the entries speed up the process by a factor of 10 ...)

davidweichiang commented 5 years ago

It probably depends on whether you use the “ulatex” or “ulatex+utf8” codec — the latter passes a char through if it can be represented in UTF8.

davidweichiang commented 5 years ago

All the chars in the XML files: https://gist.github.com/davidweichiang/9714c713cd95966e006f39286155f39b

I thought I got rid of a bunch of these, but they are apparently still in there somewhere. I'll see if I clean up more; the ones up to U+02FF and the ones with "DOT BELOW" are the ones that should probably be included in latex.py if they are not there already.

danielgildea commented 5 years ago
* expand coverage of `latex.py` to handle as many characters as possible (maybe using `latexcodec`).

I think this is done now. (bib entries on the site right now use \charXXX, but if you regenerate, they will use utf8, eg http://aclweb.org/anthology/papers/L/L08/L08-1355.bib)

* the server needs to be configured to serve `.bib` files with an encoding of UTF-8.

Matt, I think only you can do this!

mjpost commented 5 years ago

Okay, the website is now reporting the encoding as UTF-8. I'm regenerating and will push sometime when I have a better web connection.

danielgildea commented 5 years ago

Thanks, Matt. I think this issue is actually totally done now.

mjpost commented 5 years ago

I don't think this is working. I have a test document in Overleaf that uses the L08-1355 example we've been working with, and it renders as

(inh et al., 2008)

Edit: updated Overleaf link, you should be able to see it.

mbollmann commented 5 years ago

@mjpost, your test document is not public, I can't view it.

mbollmann commented 5 years ago

Added \usepackage[T1]{fontenc}, looks good now. ;)

EDIT: Better, though not perfect. Don't know what font encoding the other chars require...

mbollmann commented 5 years ago

Update: \usepackage[vietnamese]{babel} does it.

Ultimately, the problem is LaTeX's font encoding, I don't think it matters how we represent these characters in our BibTeX, these problems will always be happening until people generally switch to LuaLaTeX/XeLaTeX ...

mjpost commented 5 years ago

Okay. I think we need to find a way to publicize that you need to add \usepackage{babel}, possibly with script-specific extensions, so that we don't leave people at a loss. Would it be possible to include this in a commented-out header in each BibTeX file? Or to alter the LaTeX scripts we release for each conference, and the Overleaf template?

Second, there are still a number of ASCII-based LaTeX escapes in the generated BibTeX files. It is a bit strange to see a mix of these escapes next to unicode (for example in our working example, L08-1355). Here are all the existing escape codes still present in our generated BibTeX files. Can we add them?

\theta
\ell
\geq
\in
\cap
\epsilon
{$\approx$}
{$\pi$}
{$\times$}
{$^\circ$}
{UDepLambda$\lnot$}
{\"A}
{\"O}
{\"U}
{\"\i}
{\"a}
{\"e}
{\"o}
{\"u}
{\#}
{\&}
{\'A}
{\'C}
{\'E}
{\'O}
{\'S}
{\'U}
{\'\i}
{\'a}
{\'c}
{\'e}
{\'n}
{\'o}
{\'s}
{\'u}
{\'y}
{\.I}
{\.Z}
{\.e}
{\.z}
{\=E}
{\=O}
{\=\i}
{\=a}
{\=e}
{\=u}
{\AA}
{\DJ}
{\L}
{\O}
{\P}
{\TH}
{\^A}
{\^\i}
{\^a}
{\^e}
{\^o}
{\^u}
{\`A}
{\`O}
{\`\i}
{\`a}
{\`e}
{\`o}
{\`u}
{\aa}
{\ae}
{\copyright}
{\dh}
{\dj}
{\i}
{\ldots}
{\l}
{\o}
{\pounds}
{\ss}
{\~O}
{\~a}
{\~n}
{\~o}
{\~u}

I realize there is a long tail and we can never get them all, but if we can get full coverage in our first pass at least, I think we'll be in pretty good shape.

danielgildea commented 5 years ago

We wanted to use latex escapes as much as possible, falling back to unicode, so that as the vast majority of people can cite the vast majority of entries without using special packages, and also so that bibtex will alphabetize entries correctly.

Of the latex commands in your message above, all work out of the box in super generic latex with the exception of: \DJ \TH \dh \dj, which require

\usepackage[T1]{fontenc}.

I think it might be reasonable to use unicode for those four characters, because they are not understood by bibtex's alphabetization procedure anyway. However, for all the other accents and special characters in the list ({\~o} {\aa}, etc), bibtex will not alphabetize correctly if we use unicode.

For more on the goriest details of bibtex: http://tug.ctan.org/info/bibtex/tamethebeast/ttb_en.pdf

davidweichiang commented 5 years ago

There's a special BibTeX entry called @preamble that lets you inject code into the preamble of the document. So the entry could look something like

@preamble {"\usepackage[vietnamese]{babel}"}
@inproceedings {...}

However, someone should experiment a bit to make sure this is what we want.

As for determining which package(s) must be loaded for a given entry, there are tables that you can extract from the fontenc package that will tell you what encoding(s) you need, and the preamble would be a command to load that encoding(s). For example, for a paper with Vietnamese authors, we could auto-detect that T5 encoding is needed, and add @preamble {"\usepackage[T5]{fontenc}"}, which I believe would be enough to get it working.

I believe that the usual T1 encoding is enough for most Latin-script languages, and Vietnamese is our main challenge. If we really want to get this test case right, then for BibTeX to sort etc. correctly, a number of those special characters should still be converted into LaTeX commands. That might address @mjpost's concern about mixing Unicode and LaTeX commands.

Another possibility would be just to recommend for everyone to use XeLaTeX.

mjpost commented 5 years ago

We wanted to use latex escapes as much as possible, falling back to unicode, so that as the vast majority of people can cite the vast majority of entries without using special packages, and also so that bibtex will alphabetize entries correctly.

Ah, this is what I was missing, and makes perfect sense. Great.

davidweichiang commented 5 years ago

You could kind-of-sort-of get \DJ and \dj to alphabetize correctly if you double brace them as {{\DJ}} and {{\dj}}. Then they would alphabetize like DJ and dj.

As for \TH and \th, they apparently belong at the end of the Icelandic alphabet, which is what I think will happen by default (EDIT:) if we just left them as Unicode...

mjpost commented 5 years ago

I wasn't able to get @preamble working in my test document.

Using \usepackage[T5]{fontenc} (instead of babel) did work, though.

mjpost commented 5 years ago

Okay, I created a new issue around advertising this. Since the script is written and working, I'll re-close this issue. Thanks, everyone!