acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
432 stars 292 forks source link

BibTeX key #185

Closed mjpost closed 5 years ago

mjpost commented 5 years ago

I'm creating a new issue for this so we can discuss it.

I think the BibTeX keys should be something semantically useful as a citation aid. I suggest something like Google Scholar does, which is

{last name of first author}{year}{optional suffix}:{first content word of title}

Actually, Google doesn't insert the suffix because they don't care about key conflicts, but we will need to care about those. Perhaps that makes this approach unworkable, but we could augment this with the venue identifier to make it easier to compute.

Originally posted by @mjpost in https://github.com/acl-org/acl-anthology/issues/122#issuecomment-470972745

mbollmann commented 5 years ago

What has stopped me from implementing this so far is that I believe everyone has their own preference here. For example, for multiple authors I always use {lastname1}-{lastname2} or {lastname1}-etal because I like to know if a paper has one or several authors for purposes of grammatical agreement (singular/plural verbs). I'm sure other people have other preferences.

What I'm wondering really is if anyone even uses copied BibTeX entries without adapting the keys to their own system (e.g., JabRef has a whole templating system to accomodate individual preferences).

mjpost commented 5 years ago

I agree this is mostly a matter of personal preference and we’re not likely to agree exactly—this is partly why I appealed to authority and went with Google’s convention (which I’ve admittedly adopted myself).

I’m fine to extend that with your last name convention. But I do think the default should be something semantic in nature so that it’s at least possible to learn the “Anthology system” and not have to change anything if you don’t want to.

mbollmann commented 5 years ago

I'm playing around with this right now and wonder if we should take the length of the citation keys into account for deciding on a scheme. For example, with the current suggestion(s) we could end up with something like christodoulopoulos-etal2016:incremental (W16-1906) which is a bit unwieldy.

Some ideas:

  1. Truncate the author name to a maximum length (like 8, 10 or 12 characters)
  2. Only add the content word from the title when it's needed for disambiguation
  3. Drop the -etal specifier after all.
  4. ??

Another thing about key conflicts that came to mind: Let's say we have an entry smith2019 in our database, which only becomes ambiguous after some new proceedings are added, so it would have to be disambiguated somehow. The scripts cannot tell which proceedings "came first", so it's possible that the disambiguating element (e.g. suffix) will end up on the old key. In other words, citation keys for the current year might not be stable until the year's over.

Do we care about this? Ignore it? Is this an argument for always adding a content word, as this makes key collisions less likely?

mjpost commented 5 years ago

I think a longer complete key is better than an artificial truncation. And modern tools like Overleaf, for example, autocomplete citations found in the bib file, so typing it out isn’t a big deal (I’d even suggest having three author fields, with the last one as “etal” if necessary). I like having a word from the title in there always but maybe others will disagree.

Also I think christodoulopoulos-etal:2016:incremental looks a bit nicer.

davidweichiang commented 5 years ago

There are a number of bibkeys with non-ASCII characters, which I'm assuming is not going to work (even if one wanted it to).

mbollmann commented 5 years ago

There are a number of bibkeys with non-ASCII characters, which I'm assuming is not going to work (even if one wanted it to).

My idea was to run names through python-slugify which is already used to generate the author URLs, and guarantees (sensible) ASCII output.

mbollmann commented 5 years ago

I've started a new branch to work on this. Currently it generates bibkeys by taking

  1. Up to two author names (though this number is easily configurable), or first author's name + "-etal" if there are more;
  2. Year of publication;
  3. First content word of title.

If the key generated that way is already taken, more content words are added until it is unique, or—this is currently not very elegant—if it runs out of content words, dummy words -i are added until the key is unique.

For example, these are the generated keys for two 2013 papers by Kalchbrenner & Blunsom:

An alternative approach to this might be to try using different content words first, i.e., using just convolutional in the second example; that way, bibkeys would be kept shorter.

As mentioned before, I'm running the words through slugify; for example, the paper 調變頻譜分解之改良於強健性語音辨識(Several Refinements of Modulation Spectrum Factorization for Robust Speech Recognition) [In Chinese] gets the key chang-etal:2015:diao, as that's (presumably) the transliteration of the first Chinese character here. Not sure how useful this actually is in this case, but this is mainly to demonstrate that the approach is robust across our database.

Oh, but finally, bibkeys for full proceedings volumes are a bit useless; e.g., ACL 1998 gets the keys nn:1998:36th (from "36th Annual Meeting of...") and nn:1998:36th-annual (for volume 2 of the proceedings). I'm not sure how to derive them automatically in a smarter way...

mjpost commented 5 years ago

This is great. Should we start a PR?

I’m not sure that people really cite volumes. But what about using just {short code}:{year}:{volume number}?

davidweichiang commented 5 years ago

That first character can apparently be pronounced as either diao or tiao. According to Google Translate, in this context it should be tiao, but perhaps that's not our problem. :)

EDIT: Also, even if transliterated correctly, it won’t have mnemonic value even to Chinese speakers; maybe grab the first Latin-ish word if there is one?

mbollmann commented 5 years ago

I’m not sure that people really cite volumes. But what about using just {short code}:{year}:{volume number}?

Depends on what you mean by "short code". The only info available right now would be the high-level acronym, which is not very semantic for e.g. workshops.

EDIT: Also, even if transliterated correctly, it won’t have mnemonic value even to Chinese speakers; maybe grab the first Latin-ish word if there is one?

Do you feel there's a robust, non-hacky way to determine the "first Latin-ish word"? What I like about the "just slugify it" approach is that we don't have to care about special characters or tokenization issues, as it produces something reasonable in most cases, e.g.:

mjpost commented 5 years ago

We could expand venues_letters.yaml. All workshops have short codes exported from START. But i’d prefer simplicity here (i.e., just use “W” if nothing is found in the map) to getting too complex and delaying all of this. The vast majority of citations are to papers.

davidweichiang commented 5 years ago

Everything I can think of is probably a little hacky. I thought maybe you could do something with Unicode character properties, but it could backfire. The only simple idea I have is

m = re.fullmatch(r'.*\((.*)\)\s*\[in \s+\]\s*', s)
if m:
    s = m.group(1)

because our papers that are in other languages do often match that pattern, and in particular the ones in Chinese.

akoehn commented 5 years ago

I would like to not have ":" in the bibtex keys as a colon is not an allowed character in file names (at least for some OSes). I (and probably others as well) have my local copies of papers named after the bibtex key, which is very convenient.

Would e.g. kalchbrenner-blunsom_2013_recurrent be much worse than kalchbrenner-blunsom:2013:recurrent?

mbollmann commented 5 years ago

Would e.g. kalchbrenner-blunsom_2013_recurrent be much worse than kalchbrenner-blunsom:2013:recurrent?

I feel that would be an absolute eyesore, but maybe that just goes to show that it's a very subjective issue, as I thought :)

I would prefer sticking to - everywhere or dropping the separator altogether.

akoehn commented 5 years ago

Yes, I also prefer to use - everywhere, that is what I use as well. The underscore was just because I assumed some people like to have different delimiters (-:.

The "bibtex key can be valid file name" property is at least a bit more than a subjective thing. We currently have it and would lose it when introducing colons.

mjpost commented 5 years ago

I prefer the visual aesthetic of colons, but (grudgingly) see the utility of having keys serve as filenames. I very much dislike underscores. Let's use hyphens.

mbollmann commented 5 years ago

I've changed it to hyphens now.

I've also tried to make bibkeys for proceedings volumes a bit more semantic, by using the venue acronym instead of author names and excluding a bunch of keywords (like "proceedings") from the title. That works pretty okay for some volumes, e.g. ws-2014-bionlp for W14-3400, but many others are still pretty awful. But at least it produces something a bit more reasonable for now.

The only simple idea I have is

m = re.fullmatch(r'.*\((.*)\)\s*\[in \s+\]\s*', s)
if m:
    s = m.group(1)

because our papers that are in other languages do often match that pattern, and in particular the ones in Chinese.

I've tried this and it matches a lot of papers, but I noticed that proceedings are not consistent in the ordering of English vs. non-English titles. For example, TALN 2014 has the English title first and the French one in brackets, in which case the regex would do exactly the wrong thing. Not sure how we could address that.

mjpost commented 5 years ago

Thanks for all the work and discussion! This is now in place.

andreasvc commented 5 years ago

IMHO the Google Scholar bibkeys strike the best balance of minimalism and recognizability: no hyphens/colons, no et al, just a concatenation of first last name, year, first content word of title; e.g. smith2013parser.

mjpost commented 5 years ago

Thanks, @andreasvc. The Google Scholar format did serve as our starting point (as you saw at the start of the conversation). However, one issue with the Google keys is that they don't have to care about uniqueness of keys, but we do, since we publish a single Anthology BibTeX file that needs to be parse-able. For this reason, we expanded it a bit.

Also, I think having hyphens improves readability, e.g., junczysdowmunetal2018neural is a bit hard on the human parser.

aryamccarthy commented 5 years ago

Ready to close this?