Authors being stored under multiple spellings

villalbamartin commented 6 years ago

We currently have over 200 authors that show up with two or more names. Some examples:

Josef van Genabith shows as
- J. Genabith
- Josef (name) van Genabith (last name)
- Josef Van (name) Genabith (last name)
- Josef (name) Van Genabith (last name)
- Josef van (name) Genabith (last name)
- Joseph vanGenabith
- Josef Genabith
Laura Alonso i Alemany shows as
- Laura Alonso (name) i Alemany (last name)
- Laura (name) Alonso Alemany (last name)
- Laura Alonso (name) Alemany (last name)
Héctor Martínez Alonso shows as
- Hector Martinez Alonso
- Héctor Martínez Alonso

Although some of these we could detect automatically (specially those caused by a missing diacritic), I imagine we might be better off fixing this manually.

Any ideas on what's the best way to proceed?

villalbamartin commented 6 years ago

An update on this bug: @CTNLP and I have looked into the databse, and fixing the issue manually is not really an option: the following query (which removes diacritics and capitalization) shows that we have at least 1643 duplicate authors (around 5% of our total author database):

select unaccent(lower(full_name)), count(unaccent(lower(full_name))) from people group by unaccent(lower(full_name)) having count(unaccent(lower(full_name))) > 1 order by unaccent(lower(full_name)) asc;

This query also doesn't capture cases where a person was written with or without a name (such as Laura Alonso i Alemany, as mentioned earlier).

CTNLP commented 6 years ago

Thanks for figuring this out @villalbamartin I added looking into this to my tasks for the next week. I will try to look at some of the examples and try to see whether there are any patterns that can be fixed easily.

CTNLP commented 6 years ago

Also, once @villalbamartin puts all the changes made in response to issue number #70 on the production version of the anthology, I will contact the Google Scholar people to inform them that we have improved the information on the website.

villalbamartin commented 6 years ago

Here's the query I used to generate the list of duplicate names:

select unaccent(lower(full_name)), count(unaccent(lower(full_name))) from people group by unaccent(lower(full_name)) having count(unaccent(lower(full_name))) > 1 order by unaccent(lower(full_name)) asc;

CTNLP commented 6 years ago

I found some types of issues and we should discuss how/if we can address these. I might look for more later.

None-Complete List of Problems - sometimes these issues are combined to create further variations:

Problematic capitalization: e.g.: "Aaditya Prakash" is for some reason all lower case in C16; "Janusz Stanislaw Bien" becomes "JANUSZ STANISLAW BIEN", most likely due to OCR issues; "jeih-weih Hung" is also written "jeih-Weih Hung"; "Jennifer Chu-Carroll" is once written as "Jennifer Chu-carroll"
Occasional Initialing: e.g.: "Arantza Díaz de Ilarraza" is sometimes shortened to "A. Díaz de Ilarraza" or "Díaz de Ilarraza A."
Floating Initial: e.g: we have "István Nagy T." and "István T. Nagy", this could be caused by where the initial ends up -- should it every be in ?
Initialing without '.': e.g.: there is a "Diaz de Ilarraza A"
dropped initial: e.g. we have "István Nagy T." and "István Nagy"
Dropping of diacritics: e.g.: "Arantza Díaz de Ilarraza" sometimes becomes "Arantza Diaz de Ilarraza"
Wrong diacritics: e.g.: "Arantza Díaz de Ilarraza" sometimes becomes "Arantza Dìaz de Ilarraza"
Strange punctuation: e.g.: we have "HASIDA Koiti", "HASIDA. Koiti" and "HASIDA, Koiti ", (we also have " Koiti Hasida", "Kôiti Hasida" and "Koiti HASIDA") could be because the corresponding files result from OCR
People with locations in their name: we have "Spain Lluís Màrquez" and "Lluís Màrquez" as well as "USA Dan Klein" and "Dan Klein" and "USA Octavian Popescu, IBM Watson Research Center" and "Octavian Popescu"; these are from recent papers too, so I have no idea what is going wrong there
Incorrect first name/last name split: e.g.: "Axel-Cyrille Ngonga Ngomo" is in our database both as "Axel-Cyrille NgongaNgomo" and "Axel-CyrilleNgonga Ngomo"
Don't know what is going on: e.g.: we have: "JenniferChu-Carroll" in P04.xml and "JenniferChu-Carroll" in W00.xml, the system thinks these are different people, but I do not see the difference. Encoding issues?

CTNLP commented 6 years ago

There might actually be an easy fix for some of these problems - we could coattail on whatever Google Scholar is already doing to figure out the authors of different papers (most likely asking the authors for help). Consider the case of Agnieszka Faleńska: she appear in the Anthology under two names "Agnieszka Faleńska" and "Agnieszka Falenska", but in Google Scholar all her papers are on her author page here so we could normalize authors by trying to resolve them against that service.

mjpost commented 5 years ago

It'd be great to start building up a list of names, say in the db/ folder. I think the format should include a canonical representation for a name, along with all the variants that are observed, in an easily parseable format like JSON or YAML. We could then use two approaches:

When the site is generated, the list could be read in to merge author pages. This will be easier when the site is statically generated.
We could use this list to automatically make corrections to the authoritative XML. I think we should take care here, however; it might make sense to correct metadata only if it does not match what is included in the PDF. In this case, we'd still need approach (1).

It's also worth noting that while one person doing this manually may be too much work, we could likely crowdsource it. People may be motivated to fix their own names, at least (we've had issues filed for this).

villalbamartin commented 5 years ago

I talked about this issue with some colleagues, and one suggestion was to use global identifiers like ORCID. On the long run, I think it would make sense, for instance, to talk to the Softconf maintainers to ensure that people provide their ORCID IDs when submitting, the same way that they make people fill their Toronto Paper Matching System details when signing up as a reviewer.

In the short term, we might want to keep an ID-to-name table somewhere, where we match all known spellings of a name to an ID and then returns one of those spellings as the canonical one. In this way, we would ensure that we don't need to revise the database after every conference.

Finally, I do support the idea of crowdsourcing the job, and I agree that authors submitting their own corrections might be the best way. I wonder if, parallel to that, we should also throw a couple bucks in Mechanical Turk's direction and get people to answer "yes, 'Martín Villalba' and 'Villalba, Martin' are the same person".

CTNLP commented 5 years ago

I think the ORCID solution could only prevent future problems, but it will not solve our current issues.

I am not sure the table is worth the trouble: I would tend to say no and I think the work of figuring out which names are aliases and which ones are legitimately different people would be very hard in corner cases, which makes me think a more crude but cost effective solution might be better. E.g.: how can we be certain that something is a misspelling, unless it matches a very clear pattern? This is actually co-written by Noah A. Smith, but our data base has it as co-written by "Noah Smith". Does that mean we should have "Noah Smith" as an alias for "Noah A. Smith"? The reason he usually includes the "A." is that "Noah Smith" is a super common name. However, "NOAH A. SMITH" probably has the same referent as "Noah A. Smith", but a simple script can look for errors like that. So I would suggest that we simply check against some patterns similar to the original SQL requests that prompted this issue and if those flag names as having potential duplicates in our data base we check back with the publication chairs that want the proceedings added. This could be part of the automatic checks we do before adding proceedings: 1. make sure your xml is valid for our schema 2. here is an automatically generated list of author names that seem like variants of existing authors, are you sure you want to submit these?

davidweichiang commented 5 years ago

I wonder if START has records of the START ids of authors of past papers?

davidweichiang commented 5 years ago

Related issue: Since START started automatically filling in author information from user profiles, we have had the problem that author names in the metadata don't appear as they do on the paper. Many papers have some authors with their surnames in all caps and some not; some authors have their names in all lowercase.

Going through the XML files, I noticed that when an author's name is in all lowercase, a bug causes it to appear in the XML as <firstname>david</firstname><lastname>chiang david</lastname>. This only happens from 2017 on.

knmnyn commented 5 years ago

Sounds like a bug in the way the cdrom.tgz files that we are currently using for the Anthology are being created. We have used that particular export from STARTV2 to populate to convert the BibTeX to XML.

On Fri, Feb 15, 2019 at 9:43 AM David Chiang notifications@github.com wrote:

Related issue: Since START started automatically filling in author information from user profiles, we have had the problem that author names in the metadata don't appear as they do on the paper. Many papers have some authors with their surnames in all caps and some not; some authors have their names in all lowercase.

Going through the XML files, I noticed that when an author's name is in all lowercase, a bug causes it to appear in the XML as davidchiang david. This only happens from 2017 on.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/86#issuecomment-463874376, or mute the thread https://github.com/notifications/unsubscribe-auth/AANP63wJQT9zymgCEDzN1_pDRc7EZP2pks5vNhDagaJpZM4XMCea .

mjpost commented 5 years ago

I bet we could get the START ids from papers. I don't remember when START started using global accounts, but I seem to recall it was an effort spear-headed by CCB maybe 10 years ago.

Regarding pulling the name format from the global profile, I agree this is a problem. It's one I think should be address with a better camera-ready submission form in START. People don't realize (or think at all about the fact) that metadata is pulled from profiles, which were often created in haste as part of the reviewing cycle. I suggest the following changes to START's camera-ready submission form:

Author data is provided in first name / last name text fields that are populated from the global profiles, but are editable
START generate the citation string (using LaTeX) and get the submitter to sign off on it.

The general principle is to have these decisions and checks made as close to the ground as possible, so that we don't have to spend time on it.

davidweichiang commented 5 years ago

I think that sounds great. At the same time, conference submission instructions should make authors aware that user profiles matter. There’s an additional benefit to this, which is that the submitting author may not always know exactly how their co-authors want their names to be presented. I see this all the time in the bib files. If everyone updates their START profile to be the way they want it, it will improve consistency.

On Feb 16, 2019, at 09:45, Matt Post notifications@github.com wrote:

I bet we could get the START ids from papers. I don't remember when START started using global accounts, but I seem to recall it was an effort spear-headed by CCB maybe 10 years ago.

Regarding pulling the name format from the global profile, I agree this is a problem. It's one I think should be address with a better camera-ready submission form in START. People don't realize (or think at all about the fact) that metadata is pulled from profiles, which were often created in haste as part of the reviewing cycle. I suggest the following changes to START's camera-ready submission form:

Author data is provided in first name / last name text fields that are populated from the global profiles, but are editable START generate the citation string (using LaTeX) and get the submitter to sign off on it. The general principle is to have these decisions and checks made as close to the ground as possible, so that we don't have to spend time on it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mbollmann commented 5 years ago

I added a mechanism to include author name variants now, so we can discuss more how we want to handle this. Right now it's a YAML file that only contains Héctor Martínez Alonso (as I just picked an example from the opening post).

Names are represented as <first> || <last> || <jr>, corresponding to the respective fields in the XML, but later parts can be omitted if they're empty. The YAML file maps the canonical variant of a name to a list of its variants.

For integrating this into the website, I see two different approaches:

Conflate all variants with the canonical form. Only the canonical form will show up on the website in all contexts. This is how I currently implemented it (as it was the easiest approach, admittedly).
Keep and display the variants as they appear in the XML, but link them all to the same author page and list papers for all variants there. I can implement this if it is what people prefer.

Secondly, I started playing around with a script that generates a list of potential variants automatically. Right now it's not very sophisticated; it only conflates names that generate identical URL slugs, and applies some heuristics to try to determine the best canonical variant. Here is the name variant list it currently produces.

If we wanted to continue with this approach, we could use this list as a starting point that we manually check and correct before adding it to the repo. I could also implement some more heuristics based on the list of problems above by @CTNLP. On the other hand, @CTNLP also raised some concerns about this approach in general, so I'd like some more opinions on this first.

Do we want to continue with this approach, and how exactly should we go about it?

mjpost commented 5 years ago

Regarding your second item, I have email from Marti Hearst who has been in touch with Sebastian Kohlmeier at Allen AI / Semantic Scholar. They have done something similar themselves and also have an API that we might use to build our own list. They report a precision / recall of 91% and 97%. Here's an example:

https://api.semanticscholar.org/v1/author/2865389
"name":"Alexey G. Murzin"
"aliases": ["Alexey Murzin", "Alexey G Murzin", "A. G. Murzin", "A G Murzin", "A Murzin"]

On the first point, I hate to say that I think we should keep the author names as presented on a page, but group them according to the canonical representation. The author page would be titled by this canonical form, and would also list the variants at the top (e.g., "Published as X, Y, and Z"). This would then be followed by their papers.

Small nit on the format: it seems preferable to me to have the author names as separate fields so that we rely on YAML parsing instead of having to do further string processing to split the names. On the other hand, this format is probably easier on human eyes. I don't have strong feelings here.

davidweichiang commented 5 years ago

Does the XML really have a <jr> field? I haven't included support for that in anything I've written that touches the XML.

I thought that the standard practice was to cite an author using the exact spelling used on the paper itself. So I agree with @mjpost that names should not be canonicalized, but of course search should capture all variants.

I also agree with @mjpost about having separate fields instead of separating with ||.

To what extent should names be corrected in the XML itself?

Incorrect first/last name splits: if the BibTeX is unambiguous, should the XML use the same split? (It is sometimes different.)
Ambiguous first/last name splits: in cases where the BibTeX is ambiguous, how should the XML be split? (An automatic splitter seems to have made a lot of mistakes on names with two surnames.)
Character variations like ș (s with cedilla) vs ş (s with comma below) -- the latter is correct but the former is commonly used in its place. Should they be normalized to the same character or should it follow the paper? (Practically, this actually means following the BibTeX -- but unfortunately, sometimes the BibTeX and the paper disagree!)

Since we're ACL, I think it would be neat to include @mjpost's pronunciation database in this .yaml file.

mbollmann commented 5 years ago

Does the XML really have a <jr> field? I haven't included support for that in anything I've written that touches the XML.

Hmm, it certainly had when I started working on the rewrite, but it seems to have been removed in the meantime.

I thought that the standard practice was to cite an author using the exact spelling used on the paper itself. So I agree with @mjpost that names should not be canonicalized, but of course search should capture all variants.

I agree with this notion, however, a large part of the name variants currently capture cases where the name in the XML isn't the exact spelling on the paper. For example, Anna Kupsc is actually spelled "Anna Kupść" on all of the linked papers. Another case are names in ALL CAPS which should certainly be okay to map to their properly cased variants always.

You could argue that this should ultimately be corrected in the XML, of course, and not through this name variants feature on the website.

I also agree with @mjpost about having separate fields instead of separating with ||.

I'm pretty sure it's a safe assumption that no name entry ever will contain <space>||<space>, but okay. :)

I mainly thought it was more readable than using dicts everywhere, but I'm happy to change it to the latter.

mjpost commented 5 years ago

I looked at this revision and the YAML-parseable code is much uglier and harder for humans to work with. I guess I got what I deserved here.

I like the idea of adding pronunciations (and I fixed up the repo so that the pronunciations are a YAML file). A longer-term, more extensive idea that might fit in with @desilinguist's plans for the portal would be for users to enter this information into ACL portal profiles and for the Anthology to pull that in via an API.

mbollmann commented 5 years ago

FWIW, the corresponding revision (680e783) is pretty simple and easy to reverse...

Does anyone want to take up producing/curating a list of name variations? I like both the pronunciation database and the Semantic Scholar idea, but would like to leave this to someone not me, preferably. :) (Though I'm happy to recreate my simple name variant list based on diacritics and white-space conflation, if you think that's helpful.)

You can also see what the name variation handling currently produces on the live website now: http://www.aclweb.org/anthology/people/h/hector-martinez-alonso/

davidweichiang commented 5 years ago

I think you can also write the names as

- canonical: {first: Héctor, last: Martínez Alonso}
  variants:
    - {first: Hector, last: Martinez}
    - ...

That's not that bad, right?

And the new format leaves room for pronunciation and other information (e.g., link to Scholar/DBLP/personal page)

It's also a benefit that the following are equivalent:

- canonical: {last: Chiang, first: Wei}
- canonical: {first: Wei, last: Chiang}

davidweichiang commented 5 years ago

The end result on the website is awesome. Does "Published also as" really mean that the author actually published under these names, or is this a full list of variants listed in the YAML file? I can imagine that some people might add variants of their name for fun even though they have never actually published under that name.

mbollmann commented 5 years ago

The end result on the website is awesome. Does "Published also as" really mean that the author actually published under these names, or is this a full list of variants listed in the YAML file? I can imagine that some people might add variants of their name for fun even though they have never actually published under that name.

Good point. I think it's all listed variants right now, but that should be easy to change in create_hugo_yaml.py.

Also, {first: Noah, last: "A. Smith"} and {first: "Noah A.", last: Smith} will produce the same name string, so it might be confusing to see that "Noah A. Smith" also published as "Noah A. Smith". Maybe I should try to filter those cases out for display purposes as well?

davidweichiang commented 5 years ago

What do you think about displaying last names in bold, like:

Noah A. Smith, Noah A. Smith

Because (in this particular case) this is an error in the data, and if authors see this on their page, they'll be able to submit a correction.

mjpost commented 5 years ago

Agreed, this looks really awesome.

David's formatting is much more readable and preserves YAML parsing. I like the idea of bolding the last name.

I have some other folks that volunteered to help with this. I will email them and see if someone wants to take this up.

mjpost commented 5 years ago

One thing that comes to mind that we haven't discussed is name collisions. There have to be people with the same names, even the same canonical representations, in the Anthology (if I recall at EMNLP, there was almost an entire page of Zhangs in the index). I see a couple of ways of dealing with this:

We ignore it
We convince ACL to adopt a Screen Actors Guild policy that you have to choose a unique name
We extend the name format file to allow us to explicitly define the papers associated with authors, which would allow us to manually separate a sea of papers joined under a common name into two piles

davidweichiang commented 5 years ago

@mbollmann can you add your name variant script to the repo or somewhere else?

@mjpost Yes, we have at least four Yang Lius, at http://nlp.csai.tsinghua.edu.cn/~ly/, https://research.fb.com/people/liu-yang/, Edinburgh, and Fudan/liulishuo.

The START metadata includes plain-text information about authors' emails and affiliations (unfortunately no START ids). That would go a long way towards automating this process.

In the envisioned interface for users to edit their own profiles, would it be very difficult to also have them flag their papers?

mbollmann commented 5 years ago

The "correct" solution to name collisions, IMHO, would be to use unique IDs, not names, to identify authors in the XML. The website sort of already does this, as the slugs you see in the author URLs (e.g. hector-martinez-alonso) effectively act as IDs, and are also used to connect authors with their papers in the generated YAML files. They just correspond to name surface forms as that's the only info available in the XML.

Disambiguating authors by (a set of) e-mail addresses seems to be a really practical solution to me, especially if they could easily be included from the START metadata.

davidweichiang commented 5 years ago

Unless the email addresses are confidential? They aren't guaranteed to be the same address (if any) published on the paper itself.

mbollmann commented 5 years ago

What do you think about displaying last names in bold, like:

Noah A. Smith, Noah A. Smith

I'm trying this right now, but it only works if we do it in the title (= the canonical variant) as well. For consistency, that should also affect author pages that don't have name variants. It could look like this:

Screenshot_20190321_150132

Opinions?

davidweichiang commented 5 years ago

I tried writing an automatic variant finder and here is the result:

https://gist.github.com/davidweichiang/344919c345f58a23f27bf4cf0b53f292

There are clearly some false positives but overall it seems like this is finding some good variants as well as turning up quite a lot of errors in the XML.

davidweichiang commented 5 years ago

The script is here: https://github.com/acl-org/acl-anthology/blob/auto_name_variants/bin/auto_name_variants.py

Improvements welcome, but I think it might be working well enough that a few people could hand-correct its output.

mjpost commented 5 years ago

I’m going to mark this as closed. There may still be some errors but I think we’ve largely addressed this.

acl-org / acl-anthology

Authors being stored under multiple spellings #86