Open nschneid opened 4 years ago
Unlike #590, I think this is more important to fix, because our BibTeX styles do not change case in author names. But getting the heuristics right could be tricky.
I believe that Balamurali AR is not an edge case; there are a lot of South Asian names that use initials without periods.
Also, MCKINLEY would not be correctly lowercased by this heuristic; that could be a somewhat common case.
There might be some authors who insist on having their names in all caps or all lowercase. I think I would be okay with using name_variants.yaml to record these as exceptions.
Also, MCKINLEY would not be correctly lowercased by this heuristic; that could be a somewhat common case.
I went through the all-caps names in EMNLP 2019, and most were Chinese surnames. I suspect it is a convention in China to write romanized surnames in all-caps.
If we're really worried about mckinley/MCKINLEY and similar, we could have an additional heuristic which matches against existing names in the database.
There's already some code to match against existing author names. It could be updated and improved, and that might address this problem partly.
I've suggested in the past that we might consider contacting some people and asking them to update their START profiles. ACL 2020 is asking them to do it right now anyway.
I just looked at the EMNLP 2019 list too and saw a couple of French names and an Indian name where the surname was in all caps. I agree that your heuristic is going to be 99% correct for names written in all caps.
But I think names like Balumurali AR are common enough to worry about. It won't do to put in an exception for names that are two or three letters long, because many Chinese names are also two or three letters long.
Would it be too specific to apply your heuristic only to names that are written in Pinyin, which is very easy to check?
I don't know how that is checked but it should cover most of the cases. Maybe the rest should require a manual decision to whitelist or truecase.
And the manual decision can usually be made by checking the PDF. Even better if we could scrape the author capitalization from the PDF, but that might be too hard.
We do have a script that scrapes from PDF. It is not run regularly, though. And sometimes authors use all caps in the PDF too.
The Pinyin filter would be a good 90% solution; my main worry is that a language specific rule could be perceived as discriminatory.
Eh...it seems to me the status quo is (unintentionally) discriminatory against people whose surnames are sometimes entered in all-caps, because inconsistencies will make it harder to browse their work. And these are disproportionately people from China. So it makes sense to correct that, and ideally not in a way that hypercorrects South Asian abbreviated names.
Ideally this would be something that START would encourage authors to specify consistently in the first place ("You entered 'LU', but ACL style is to use only initial capitals within names. Did you mean 'Lu'?"). But we don't really have control over that.
I agree that ideally this should happen earlier than ingestion into the Anthology, because names from START also appear in the conference website, handbook, etc.
I tried a simpler version of these heuristics on the EMNLP 2018 authors, and it worked perfectly except for one possible false positive (the first name "cmcc"). The heuristic is:
FWIW, START does have a tool in the pub chair console for correcting case problems in both titles and authors. I don't know whether it is regularly used. It also makes some mistakes (e.g., III
is converted to Iii
, and di
is not converted to Di
even if part of a Chinese name). And presumably changes to author names are not propagated back up to the global profile.
Running this heuristic on the current Anthology authors yields 872 corrections. There are some false positives, though. Some seem fixable (MAXWELL III -> Maxwell Iii) but some seem tougher, especially corporate authors like ARC A3 or TIPSTER SE/CM.
Nice! Could we run this periodically and record the exceptions as having been manually checked?
It would be a tedious process each time. I am hoping that START will incorporate something like this so we don't have to deal with it. But otherwise, it would make most sense, I think, to have it run automatically at ingestion time.
With commit ab92b62a8eb2d99c88abaf8330dc11261cb382d6, ingest.py now prompts the ingestor to confirm capitalization, when it discovers all-lowercase or all-uppercase names. Truecasing would be a better approach, but I think this quick fix probably captures 99% of instances.
I agree START should do this, but it also seems fixable at ingestion time, which is a longer-term solution.
Related to #638, #641: Many author names in EMNLP 2019 are all-caps or all-lowercase, presumably because that is how they appear in START. It seems impractical to fix them manually for every conference. Should there be a heuristic in the ingestion script that corrects these? For example:
The canonical form in name_variants.yaml could serve as a whitelist for known exceptions, e.g. "Balamurali AR". Note that the above heuristics preserve mixed-case names like ChengXiang and McKinley, so these do not need whitelisting.