acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
442 stars 299 forks source link

Disambiguating author names #321

Open davidweichiang opened 5 years ago

davidweichiang commented 5 years ago

The following are potentially ambiguous names that are in need of someone to disambiguate.

Please feel free to add more names to the list.

Disambiguating a name means:

Eventually we hope to do this automatically, but in the meantime this is a manual process; moreover, some cases (e.g., when there are only two papers) may be difficult to automate.

mjpost commented 5 years ago

This also suggests: the web pages for ambiguous canonical names should display a list of the other author names ("You might also be looking for...") and also maybe a link to an issue creation page where people can note mistakes.

I further wonder—for ambiguous author names, perhaps the top-level canonical slug page (e.g., https://www.aclweb.org/anthology/people/y/yang-liu/) should be the disambiguation page alone, providing links to each separate author.

davidweichiang commented 5 years ago

I think those are good ideas. The first idea would work better if we had required unique canonical names. Otherwise, it's going to look like this:

Yang Liu Published also as: Y. Liu You might also be looking for: Yang Liu, Yang Liu, Yang Liu

which is not terribly informative.

The second idea would need some new kind of entry in the name variants file to indicate what slug should be used for the disambiguation page, like

- ambiguous: Yang Liu
  id: yang-liu
mjpost commented 5 years ago

The ID is the primary distinguishing feature, but if we supplemented the name variant with other information and stored it in tags (e.g., degree institution, name in Chinese characters or other non-English script, birth year), we could have some rules for displaying that information parenthetically.

e.g., (all made up data):

Yang Liu Published also as: Y. Liu You might also be looking for: Yang Liu (b. 1981), Yang Liu (Ph.D. UPenn, 2009), Yang Liu (BBN)

davidweichiang commented 5 years ago

OK, so we could:

  1. Incorporate this information into the canonical name. There's no requirement for the canonical name to appear on any actual paper; it would just appear at the top of the author page and in co-author lists.

  2. Use the existing comment field (in name_variants.yaml), i.e., the name displayed at the top of the author page would be <canonical> (<comment>). We'd just have to make sure that all the comments are presentable. In this case, we'd have more flexibility, e.g., to include this information at the top of the author page, but exclude it from co-author lists.

  3. More detailed fields in name_variants.yaml like birth_year, degree_institution, affiliations, alternative_names, etc.

(1) is simple (in particular, requiring canonical names to be unique allows them to be treated more uniformly with variant names, which are already required to be unique). But (2) is good too. I think (3) would be a bit much, although names in other scripts (as well as pronunciations) was something we were kicking around independently and I still think it'd be nice.

mjpost commented 5 years ago

We may have use for wanting only the name at some point, which speaks against (1). I like (2)—it's a nice flexible human-driven resolution to the problem.