Closed villalbamartin closed 5 years ago
An update on this bug: @CTNLP and I have looked into the databse, and fixing the issue manually is not really an option: the following query (which removes diacritics and capitalization) shows that we have at least 1643 duplicate authors (around 5% of our total author database):
select unaccent(lower(full_name)), count(unaccent(lower(full_name))) from people group by unaccent(lower(full_name)) having count(unaccent(lower(full_name))) > 1 order by unaccent(lower(full_name)) asc;
This query also doesn't capture cases where a person was written with or without a name (such as Laura Alonso i Alemany, as mentioned earlier).
Thanks for figuring this out @villalbamartin I added looking into this to my tasks for the next week. I will try to look at some of the examples and try to see whether there are any patterns that can be fixed easily.
Also, once @villalbamartin puts all the changes made in response to issue number #70 on the production version of the anthology, I will contact the Google Scholar people to inform them that we have improved the information on the website.
Here's the query I used to generate the list of duplicate names:
select unaccent(lower(full_name)), count(unaccent(lower(full_name))) from people group by unaccent(lower(full_name)) having count(unaccent(lower(full_name))) > 1 order by unaccent(lower(full_name)) asc;
I found some types of issues and we should discuss how/if we can address these. I might look for more later.
None-Complete List of Problems - sometimes these issues are combined to create further variations:
There might actually be an easy fix for some of these problems - we could coattail on whatever Google Scholar is already doing to figure out the authors of different papers (most likely asking the authors for help). Consider the case of Agnieszka Faleńska: she appear in the Anthology under two names "Agnieszka Faleńska" and "Agnieszka Falenska", but in Google Scholar all her papers are on her author page here so we could normalize authors by trying to resolve them against that service.
It'd be great to start building up a list of names, say in the db/ folder. I think the format should include a canonical representation for a name, along with all the variants that are observed, in an easily parseable format like JSON or YAML. We could then use two approaches:
When the site is generated, the list could be read in to merge author pages. This will be easier when the site is statically generated.
We could use this list to automatically make corrections to the authoritative XML. I think we should take care here, however; it might make sense to correct metadata only if it does not match what is included in the PDF. In this case, we'd still need approach (1).
It's also worth noting that while one person doing this manually may be too much work, we could likely crowdsource it. People may be motivated to fix their own names, at least (we've had issues filed for this).
I talked about this issue with some colleagues, and one suggestion was to use global identifiers like ORCID. On the long run, I think it would make sense, for instance, to talk to the Softconf maintainers to ensure that people provide their ORCID IDs when submitting, the same way that they make people fill their Toronto Paper Matching System details when signing up as a reviewer.
In the short term, we might want to keep an ID-to-name table somewhere, where we match all known spellings of a name to an ID and then returns one of those spellings as the canonical one. In this way, we would ensure that we don't need to revise the database after every conference.
Finally, I do support the idea of crowdsourcing the job, and I agree that authors submitting their own corrections might be the best way. I wonder if, parallel to that, we should also throw a couple bucks in Mechanical Turk's direction and get people to answer "yes, 'Martín Villalba' and 'Villalba, Martin' are the same person".
I think the ORCID solution could only prevent future problems, but it will not solve our current issues.
I am not sure the table is worth the trouble: I would tend to say no and I think the work of figuring out which names are aliases and which ones are legitimately different people would be very hard in corner cases, which makes me think a more crude but cost effective solution might be better. E.g.: how can we be certain that something is a misspelling, unless it matches a very clear pattern? This is actually co-written by Noah A. Smith, but our data base has it as co-written by "Noah Smith". Does that mean we should have "Noah Smith" as an alias for "Noah A. Smith"? The reason he usually includes the "A." is that "Noah Smith" is a super common name. However, "NOAH A. SMITH" probably has the same referent as "Noah A. Smith", but a simple script can look for errors like that. So I would suggest that we simply check against some patterns similar to the original SQL requests that prompted this issue and if those flag names as having potential duplicates in our data base we check back with the publication chairs that want the proceedings added. This could be part of the automatic checks we do before adding proceedings: 1. make sure your xml is valid for our schema 2. here is an automatically generated list of author names that seem like variants of existing authors, are you sure you want to submit these?
I wonder if START has records of the START ids of authors of past papers?
Related issue: Since START started automatically filling in author information from user profiles, we have had the problem that author names in the metadata don't appear as they do on the paper. Many papers have some authors with their surnames in all caps and some not; some authors have their names in all lowercase.
Going through the XML files, I noticed that when an author's name is in all lowercase, a bug causes it to appear in the XML as <firstname>david</firstname><lastname>chiang david</lastname>
. This only happens from 2017 on.
Sounds like a bug in the way the cdrom.tgz files that we are currently using for the Anthology are being created. We have used that particular export from STARTV2 to populate to convert the BibTeX to XML.
On Fri, Feb 15, 2019 at 9:43 AM David Chiang notifications@github.com wrote:
Related issue: Since START started automatically filling in author information from user profiles, we have had the problem that author names in the metadata don't appear as they do on the paper. Many papers have some authors with their surnames in all caps and some not; some authors have their names in all lowercase.
Going through the XML files, I noticed that when an author's name is in all lowercase, a bug causes it to appear in the XML as
david chiang david . This only happens from 2017 on.— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/86#issuecomment-463874376, or mute the thread https://github.com/notifications/unsubscribe-auth/AANP63wJQT9zymgCEDzN1_pDRc7EZP2pks5vNhDagaJpZM4XMCea .
I bet we could get the START ids from papers. I don't remember when START started using global accounts, but I seem to recall it was an effort spear-headed by CCB maybe 10 years ago.
Regarding pulling the name format from the global profile, I agree this is a problem. It's one I think should be address with a better camera-ready submission form in START. People don't realize (or think at all about the fact) that metadata is pulled from profiles, which were often created in haste as part of the reviewing cycle. I suggest the following changes to START's camera-ready submission form:
The general principle is to have these decisions and checks made as close to the ground as possible, so that we don't have to spend time on it.
I think that sounds great. At the same time, conference submission instructions should make authors aware that user profiles matter. There’s an additional benefit to this, which is that the submitting author may not always know exactly how their co-authors want their names to be presented. I see this all the time in the bib files. If everyone updates their START profile to be the way they want it, it will improve consistency.
On Feb 16, 2019, at 09:45, Matt Post notifications@github.com wrote:
I bet we could get the START ids from papers. I don't remember when START started using global accounts, but I seem to recall it was an effort spear-headed by CCB maybe 10 years ago.
Regarding pulling the name format from the global profile, I agree this is a problem. It's one I think should be address with a better camera-ready submission form in START. People don't realize (or think at all about the fact) that metadata is pulled from profiles, which were often created in haste as part of the reviewing cycle. I suggest the following changes to START's camera-ready submission form:
Author data is provided in first name / last name text fields that are populated from the global profiles, but are editable START generate the citation string (using LaTeX) and get the submitter to sign off on it. The general principle is to have these decisions and checks made as close to the ground as possible, so that we don't have to spend time on it.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
I added a mechanism to include author name variants now, so we can discuss more how we want to handle this. Right now it's a YAML file that only contains Héctor Martínez Alonso (as I just picked an example from the opening post).
Names are represented as <first> || <last> || <jr>
, corresponding to the respective fields in the XML, but later parts can be omitted if they're empty. The YAML file maps the canonical variant of a name to a list of its variants.
For integrating this into the website, I see two different approaches:
Secondly, I started playing around with a script that generates a list of potential variants automatically. Right now it's not very sophisticated; it only conflates names that generate identical URL slugs, and applies some heuristics to try to determine the best canonical variant. Here is the name variant list it currently produces.
If we wanted to continue with this approach, we could use this list as a starting point that we manually check and correct before adding it to the repo. I could also implement some more heuristics based on the list of problems above by @CTNLP. On the other hand, @CTNLP also raised some concerns about this approach in general, so I'd like some more opinions on this first.
Do we want to continue with this approach, and how exactly should we go about it?
Regarding your second item, I have email from Marti Hearst who has been in touch with Sebastian Kohlmeier at Allen AI / Semantic Scholar. They have done something similar themselves and also have an API that we might use to build our own list. They report a precision / recall of 91% and 97%. Here's an example:
https://api.semanticscholar.org/v1/author/2865389
"name":"Alexey G. Murzin"
"aliases": ["Alexey Murzin", "Alexey G Murzin", "A. G. Murzin", "A G Murzin", "A Murzin"]
On the first point, I hate to say that I think we should keep the author names as presented on a page, but group them according to the canonical representation. The author page would be titled by this canonical form, and would also list the variants at the top (e.g., "Published as X, Y, and Z"). This would then be followed by their papers.
Small nit on the format: it seems preferable to me to have the author names as separate fields so that we rely on YAML parsing instead of having to do further string processing to split the names. On the other hand, this format is probably easier on human eyes. I don't have strong feelings here.
Does the XML really have a <jr>
field? I haven't included support for that in anything I've written that touches the XML.
I thought that the standard practice was to cite an author using the exact spelling used on the paper itself. So I agree with @mjpost that names should not be canonicalized, but of course search should capture all variants.
I also agree with @mjpost about having separate fields instead of separating with ||
.
To what extent should names be corrected in the XML itself?
Since we're ACL, I think it would be neat to include @mjpost's pronunciation database in this .yaml file.
Does the XML really have a
<jr>
field? I haven't included support for that in anything I've written that touches the XML.
Hmm, it certainly had when I started working on the rewrite, but it seems to have been removed in the meantime.
I thought that the standard practice was to cite an author using the exact spelling used on the paper itself. So I agree with @mjpost that names should not be canonicalized, but of course search should capture all variants.
I agree with this notion, however, a large part of the name variants currently capture cases where the name in the XML isn't the exact spelling on the paper. For example, Anna Kupsc is actually spelled "Anna Kupść" on all of the linked papers. Another case are names in ALL CAPS which should certainly be okay to map to their properly cased variants always.
You could argue that this should ultimately be corrected in the XML, of course, and not through this name variants feature on the website.
I also agree with @mjpost about having separate fields instead of separating with
||
.
I'm pretty sure it's a safe assumption that no name entry ever will contain <space>||<space>
, but okay. :)
I mainly thought it was more readable than using dicts everywhere, but I'm happy to change it to the latter.
I looked at this revision and the YAML-parseable code is much uglier and harder for humans to work with. I guess I got what I deserved here.
I like the idea of adding pronunciations (and I fixed up the repo so that the pronunciations are a YAML file). A longer-term, more extensive idea that might fit in with @desilinguist's plans for the portal would be for users to enter this information into ACL portal profiles and for the Anthology to pull that in via an API.
FWIW, the corresponding revision (680e783) is pretty simple and easy to reverse...
Does anyone want to take up producing/curating a list of name variations? I like both the pronunciation database and the Semantic Scholar idea, but would like to leave this to someone not me, preferably. :) (Though I'm happy to recreate my simple name variant list based on diacritics and white-space conflation, if you think that's helpful.)
You can also see what the name variation handling currently produces on the live website now: http://www.aclweb.org/anthology/people/h/hector-martinez-alonso/
I think you can also write the names as
- canonical: {first: Héctor, last: Martínez Alonso}
variants:
- {first: Hector, last: Martinez}
- ...
That's not that bad, right?
And the new format leaves room for pronunciation and other information (e.g., link to Scholar/DBLP/personal page)
It's also a benefit that the following are equivalent:
- canonical: {last: Chiang, first: Wei}
- canonical: {first: Wei, last: Chiang}
The end result on the website is awesome. Does "Published also as" really mean that the author actually published under these names, or is this a full list of variants listed in the YAML file? I can imagine that some people might add variants of their name for fun even though they have never actually published under that name.
The end result on the website is awesome. Does "Published also as" really mean that the author actually published under these names, or is this a full list of variants listed in the YAML file? I can imagine that some people might add variants of their name for fun even though they have never actually published under that name.
Good point. I think it's all listed variants right now, but that should be easy to change in create_hugo_yaml.py
.
Also, {first: Noah, last: "A. Smith"}
and {first: "Noah A.", last: Smith}
will produce the same name string, so it might be confusing to see that "Noah A. Smith" also published as "Noah A. Smith". Maybe I should try to filter those cases out for display purposes as well?
What do you think about displaying last names in bold, like:
Noah A. Smith, Noah A. Smith
Because (in this particular case) this is an error in the data, and if authors see this on their page, they'll be able to submit a correction.
Agreed, this looks really awesome.
David's formatting is much more readable and preserves YAML parsing. I like the idea of bolding the last name.
I have some other folks that volunteered to help with this. I will email them and see if someone wants to take this up.
One thing that comes to mind that we haven't discussed is name collisions. There have to be people with the same names, even the same canonical representations, in the Anthology (if I recall at EMNLP, there was almost an entire page of Zhangs in the index). I see a couple of ways of dealing with this:
@mbollmann can you add your name variant script to the repo or somewhere else?
@mjpost Yes, we have at least four Yang Lius, at http://nlp.csai.tsinghua.edu.cn/~ly/, https://research.fb.com/people/liu-yang/, Edinburgh, and Fudan/liulishuo.
The START metadata includes plain-text information about authors' emails and affiliations (unfortunately no START ids). That would go a long way towards automating this process.
In the envisioned interface for users to edit their own profiles, would it be very difficult to also have them flag their papers?
The "correct" solution to name collisions, IMHO, would be to use unique IDs, not names, to identify authors in the XML. The website sort of already does this, as the slugs you see in the author URLs (e.g. hector-martinez-alonso
) effectively act as IDs, and are also used to connect authors with their papers in the generated YAML files. They just correspond to name surface forms as that's the only info available in the XML.
Disambiguating authors by (a set of) e-mail addresses seems to be a really practical solution to me, especially if they could easily be included from the START metadata.
Unless the email addresses are confidential? They aren't guaranteed to be the same address (if any) published on the paper itself.
What do you think about displaying last names in bold, like:
Noah A. Smith, Noah A. Smith
I'm trying this right now, but it only works if we do it in the title (= the canonical variant) as well. For consistency, that should also affect author pages that don't have name variants. It could look like this:
Opinions?
I tried writing an automatic variant finder and here is the result:
https://gist.github.com/davidweichiang/344919c345f58a23f27bf4cf0b53f292
There are clearly some false positives but overall it seems like this is finding some good variants as well as turning up quite a lot of errors in the XML.
The script is here: https://github.com/acl-org/acl-anthology/blob/auto_name_variants/bin/auto_name_variants.py
Improvements welcome, but I think it might be working well enough that a few people could hand-correct its output.
I’m going to mark this as closed. There may still be some errors but I think we’ve largely addressed this.
We currently have over 200 authors that show up with two or more names. Some examples:
Although some of these we could detect automatically (specially those caused by a missing diacritic), I imagine we might be better off fixing this manually.
Any ideas on what's the best way to proceed?