Closed mjpost closed 5 years ago
Two things left to do on this issue:
I'm integrating the anth2bib.py
script into the current Hugo generation pipeline, and have a question:
The authoritative XML often contains <bibtype>
and <bibkey>
fields that are also rendered on the website, but the current anth2bib.py
doesn't consider them. It either should, or we should get rid of them in the XML.
I'm not sure if <bibkey>
is just there for historical reasons or what the rationale behind it is; personally I find it much more logical to use the Anthology ID as the key, so would be in favour of dropping it and keeping the current behaviour of the conversion script -- unless there are other reasons for not doing it this way.
I'm equally unsure where <bibtype>
comes from, but would probably suggest that the conversion script should respect it when it's there.
I think the <bib*>
keys are preserved from the ingestion process, and I think you're right that they're redundant and we don't need them.
As to the key to use: I don't like the idea of using the identifier because it carries no semantic information and is consequently a pain to use when writing papers. I personally prefer something like Google Scholar does, which is
{last name of first author}{year}{optional suffix}:{first content word of title}
Actually, Google doesn't insert the suffix because they don't care about key conflicts, but we will need to care about those. Perhaps that makes this approach unworkable, but we could augment this with the venue identifier to make it easier to compute.
I would add that some of the existing keys are not valid because they contain things like curly braces. And some of them are unusably long.
@davidweichiang, would you mind if I refactored your code into my anthology
module? That way the YAML generation could have access to the generated BibTeX keys, which would be necessary if they don't come directly from the XML anymore, I think.
FTR, the current Anthology is using the Anthology ID, e.g., P18-1251 (the original ingested version isn‘t quite useful I think).
@mbollmann Which code are you referring to?
It looks like the bib keys are created by:
https://github.com/acl-org/ACLPUB/blob/localv3/bin/bib.pl
and passed through by anthoBibs2xml.pl. I think these keys are too long, although somewhat better than the anthology ids.
(The bibtype for a proceedings should really be @Proceedings
not @Book
.)
@davidweichiang the printbib
code in anth2bib.py
. If that logic was in anthology/papers.py
instead, we could also assign unique bibkeys in whatever format we prefer there, and the YAML export could access it to display the correct bibkey on the website.
(On the other hand, if we don't care about the "bibkey" entry on the website and say the bib entry download is good enough instead, there's no pressing need to change this right now.)
anth2bib.py
was written by me. Sure, go ahead and refactor however makes sense.
The LaTeX to UTF8 conversion in latex.py
is still not complete; for example, it doesn't get ą
-> {\k a}
. A couple of ideas:
Switch to latexcodec, decoding a character at a time to make special characters safe for BibTeX, like this:
for u in s:
t = codecs.decode(u, 'ulatex')
if t != u: # or maybe t.startswith('\\')?
t = '{' + t + '}'
yield t
Keep latex.py
but make the accent handling more complete by doing unicodedata.normalize('NFD')
to make accents into separate chars, then convert each accent to its TeX equivalent, e.g., s = re.sub(r'(.)\u0301', r"{\\'\1}")
. There's only like ten accents, so you can cover more cases with fewer lines of code this way.
I'll work on refactoring this into the anthology/
module later today; maybe hold off on further bug fixes on the script until then. I'll post here where the new logic lives.
@danielgildea, what was the reason to include latex.py
instead of going with latexcodec
? The latter seems to be based on the former, but is actively maintained.
I think the main reason was that he wanted é
to be generated as {\'e}
, which latexcodec does not do (but the above workaround gets).
Meanwhile, latexcodec is good but not perfect...I've got about 20 lines of workarounds for various issues, some of which the author has quickly fixed, but some which appear to have no fix.
I think the main reason was that he wanted
é
to be generated as{\'e}
, which latexcodec does not do (but the above workaround gets).
It does for me? Or was this added in the meantime?
>>> import latexcodec
>>> import codecs
>>> print(codecs.encode("Tést", "ulatex"))
T\'est
Do you have your workarounds collected somewhere? Or are they integrated in our latex.py
?
For BibTeX, it should be {\'e}
, not \'e
or \'{e}
. (See https://github.com/acl-org/acl-anthology/issues/135 and the BibTeX FAQ).
My workarounds are in bin/tex_unicode.py
, which is only used in this repo for XML cleanup, but I have further modified it and it's going to move over to ACLPUB
as part of anthologize.pl
(or .py
).
I should have also said, my workarounds are only needed in the decoding direction. The encoding direction, I imagine, has fewer pitfalls. The develop branch should have 100% coverage for our data, I think.
Okay, I've kept our latex.py
for now. I think it has some advantages to build around an already maintained package, but latex.py
seems to have a lot of customizations for our use case (the curly braces around special characters are just one of them -- latexcodec
chokes on Chinese characters, for example).
I've refactored @danielgildea's script into the anthology module:
anthology/formatter.py
anthology/papers.py
latex.py
has moved to anthology/latexcodec.py
I'm not sure this refactoring is ideal yet, but it gets rid of a lot of code duplication -- e.g., interpreting volume/issue numbers, deriving the publication year, cleaning up dashes in "pages", etc., all of which is already done in the anthology module.
Here's an example of a Unicode char still getting passed through to BibTeX: http://anthology.aclweb.org/W/W16/W16-1815.bib
I think Chinese characters should work if you use latexcodec with codecs.decode('ulatex+utf8')
(which tells it that it can pass UTF-8 characters through), but we're still stuck with the problem that BibTeX doesn't support Unicode. What bad thing happens -- does BibTeX just not know how to sort?
What bad thing happens -- does BibTeX just not know how to sort?
Yes, it's just a problem with sorting. Well, and the fact that you will get garbage
in the pdf if you do not have a super recent latex where utf8 is the default for input files
and you did not say \usepackage[utf8]{inputenc}
in your latex file.
Presumably it's also a problem for bibliography styles that have citations like [Gil19].
Since biber/biblatex
also use .bib
files and do handle UTF-8, I think it's fair for us to use UTF-8 in .bib
files, but we should try to use as little as possible. So I suggest:
latex.py
to handle as many characters as possible (maybe using latexcodec
)..bib
files with an encoding of UTF-8.Another issue with our current latex.py
(?): some characters seem to get escaped twice, as I just stumbled across ñ
becoming {\\textasciitilde n}
in P03-2003.
EDIT: Should we maybe take the time to write unit tests for the BibTeX generation? It seems there's a lot that can go wrong here.
That looks doubly wrong, since this should become {\~n}
. But I don't see where in the code this could be coming from.
Another issue with our current
latex.py
(?): some characters seem to get escaped twice, as I just stumbled acrossñ
becoming{\\textasciitilde n}
in P03-2003.
I think it is getting re-escaped by the pybtex library. See also ampersands in titles: http://aclweb.org/anthology/papers/P/P14/P14-1131.bib
I think it is getting re-escaped by the pybtex library.
Welp. Seems like you really have to do everything yourself... I'll look into changing this tomorrow.
We used pybtex in bibsearch, and I seem to remember having this exact same issue. We ended up just manually correcting it in postprocessing.
Oh, I didn't realize you were using pybtex. It uses latexcodec to encode all fields (even url and doi, unfortunately).
This is strange, as latexcodec (for me) encodes ø, é, å etc. while pybtex leaves them alone, so I assumed it left all content alone. I think for our use case I will just go back to printing the bib entry manually now, as we already take care of escaping/encoding separately.
(EDIT: Incidentally, not going through pybtex for printing the entries speed up the process by a factor of 10 ...)
It probably depends on whether you use the “ulatex” or “ulatex+utf8” codec — the latter passes a char through if it can be represented in UTF8.
All the chars in the XML files: https://gist.github.com/davidweichiang/9714c713cd95966e006f39286155f39b
I thought I got rid of a bunch of these, but they are apparently still in there somewhere. I'll see if I clean up more; the ones up to U+02FF and the ones with "DOT BELOW" are the ones that should probably be included in latex.py if they are not there already.
* expand coverage of `latex.py` to handle as many characters as possible (maybe using `latexcodec`).
I think this is done now. (bib entries on the site right now use \charXXX, but if you regenerate, they will use utf8, eg http://aclweb.org/anthology/papers/L/L08/L08-1355.bib)
* the server needs to be configured to serve `.bib` files with an encoding of UTF-8.
Matt, I think only you can do this!
Okay, the website is now reporting the encoding as UTF-8. I'm regenerating and will push sometime when I have a better web connection.
Thanks, Matt. I think this issue is actually totally done now.
I don't think this is working. I have a test document in Overleaf that uses the L08-1355 example we've been working with, and it renders as
(inh et al., 2008)
Edit: updated Overleaf link, you should be able to see it.
@mjpost, your test document is not public, I can't view it.
Added \usepackage[T1]{fontenc}
, looks good now. ;)
EDIT: Better, though not perfect. Don't know what font encoding the other chars require...
Update: \usepackage[vietnamese]{babel}
does it.
Ultimately, the problem is LaTeX's font encoding, I don't think it matters how we represent these characters in our BibTeX, these problems will always be happening until people generally switch to LuaLaTeX/XeLaTeX ...
Okay. I think we need to find a way to publicize that you need to add \usepackage{babel}
, possibly with script-specific extensions, so that we don't leave people at a loss. Would it be possible to include this in a commented-out header in each BibTeX file? Or to alter the LaTeX scripts we release for each conference, and the Overleaf template?
Second, there are still a number of ASCII-based LaTeX escapes in the generated BibTeX files. It is a bit strange to see a mix of these escapes next to unicode (for example in our working example, L08-1355). Here are all the existing escape codes still present in our generated BibTeX files. Can we add them?
\theta
\ell
\geq
\in
\cap
\epsilon
{$\approx$}
{$\pi$}
{$\times$}
{$^\circ$}
{UDepLambda$\lnot$}
{\"A}
{\"O}
{\"U}
{\"\i}
{\"a}
{\"e}
{\"o}
{\"u}
{\#}
{\&}
{\'A}
{\'C}
{\'E}
{\'O}
{\'S}
{\'U}
{\'\i}
{\'a}
{\'c}
{\'e}
{\'n}
{\'o}
{\'s}
{\'u}
{\'y}
{\.I}
{\.Z}
{\.e}
{\.z}
{\=E}
{\=O}
{\=\i}
{\=a}
{\=e}
{\=u}
{\AA}
{\DJ}
{\L}
{\O}
{\P}
{\TH}
{\^A}
{\^\i}
{\^a}
{\^e}
{\^o}
{\^u}
{\`A}
{\`O}
{\`\i}
{\`a}
{\`e}
{\`o}
{\`u}
{\aa}
{\ae}
{\copyright}
{\dh}
{\dj}
{\i}
{\ldots}
{\l}
{\o}
{\pounds}
{\ss}
{\~O}
{\~a}
{\~n}
{\~o}
{\~u}
I realize there is a long tail and we can never get them all, but if we can get full coverage in our first pass at least, I think we'll be in pretty good shape.
We wanted to use latex escapes as much as possible, falling back to unicode, so that as the vast majority of people can cite the vast majority of entries without using special packages, and also so that bibtex will alphabetize entries correctly.
Of the latex commands in your message above, all work out of the box in super
generic latex with the exception of: \DJ \TH \dh \dj
, which require
\usepackage[T1]{fontenc}
.
I think it might be reasonable to use unicode for those four characters, because
they are not understood by bibtex's alphabetization procedure anyway. However, for
all the other accents and special characters in the list ({\~o}
{\aa}
, etc),
bibtex will not alphabetize correctly if we use unicode.
For more on the goriest details of bibtex: http://tug.ctan.org/info/bibtex/tamethebeast/ttb_en.pdf
There's a special BibTeX entry called @preamble
that lets you inject code into the preamble of the document. So the entry could look something like
@preamble {"\usepackage[vietnamese]{babel}"}
@inproceedings {...}
However, someone should experiment a bit to make sure this is what we want.
As for determining which package(s) must be loaded for a given entry, there are tables that you can extract from the fontenc
package that will tell you what encoding(s) you need, and the preamble would be a command to load that encoding(s). For example, for a paper with Vietnamese authors, we could auto-detect that T5 encoding is needed, and add @preamble {"\usepackage[T5]{fontenc}"}
, which I believe would be enough to get it working.
I believe that the usual T1 encoding is enough for most Latin-script languages, and Vietnamese is our main challenge. If we really want to get this test case right, then for BibTeX to sort etc. correctly, a number of those special characters should still be converted into LaTeX commands. That might address @mjpost's concern about mixing Unicode and LaTeX commands.
Another possibility would be just to recommend for everyone to use XeLaTeX.
We wanted to use latex escapes as much as possible, falling back to unicode, so that as the vast majority of people can cite the vast majority of entries without using special packages, and also so that bibtex will alphabetize entries correctly.
Ah, this is what I was missing, and makes perfect sense. Great.
You could kind-of-sort-of get \DJ
and \dj
to alphabetize correctly if you double brace them as {{\DJ}}
and {{\dj}}
. Then they would alphabetize like DJ
and dj
.
As for \TH
and \th
, they apparently belong at the end of the Icelandic alphabet, which is what I think will happen by default (EDIT:) if we just left them as Unicode...
I wasn't able to get @preamble
working in my test document.
Using \usepackage[T5]{fontenc}
(instead of babel) did work, though.
Okay, I created a new issue around advertising this. Since the script is written and working, I'll re-close this issue. Thanks, everyone!
We can generate the BibTeX entries directly from the authoritative XML files under
import/
. This will take a bit of work but will allow us to preserve components like title casing, properly generate the month field, and include the abstracts.