Author name truecasing - Githubissues

nschneid commented 4 years ago

Related to #638, #641: Many author names in EMNLP 2019 are all-caps or all-lowercase, presumably because that is how they appear in START. It seems impractical to fix them manually for every conference. Should there be a heuristic in the ingestion script that corrects these? For example:

Let "word" be a segment of the name when splitting on spaces and hyphens.
If the first name contains no capital letters, capitalize the first character of every word in the first name. Likewise for last name.
If the first name has more than one uppercase letter and no lowercase letters, lowercase all but the first letter in each word. Likewise for last name, except the last word of the last name if it is "II", "III", or "IV".

The canonical form in name_variants.yaml could serve as a whitelist for known exceptions, e.g. "Balamurali AR". Note that the above heuristics preserve mixed-case names like ChengXiang and McKinley, so these do not need whitelisting.

davidweichiang commented 4 years ago

Unlike #590, I think this is more important to fix, because our BibTeX styles do not change case in author names. But getting the heuristics right could be tricky.

I believe that Balamurali AR is not an edge case; there are a lot of South Asian names that use initials without periods.

Also, MCKINLEY would not be correctly lowercased by this heuristic; that could be a somewhat common case.

There might be some authors who insist on having their names in all caps or all lowercase. I think I would be okay with using name_variants.yaml to record these as exceptions.

nschneid commented 4 years ago

Also, MCKINLEY would not be correctly lowercased by this heuristic; that could be a somewhat common case.

I went through the all-caps names in EMNLP 2019, and most were Chinese surnames. I suspect it is a convention in China to write romanized surnames in all-caps.

If we're really worried about mckinley/MCKINLEY and similar, we could have an additional heuristic which matches against existing names in the database.

davidweichiang commented 4 years ago

There's already some code to match against existing author names. It could be updated and improved, and that might address this problem partly.

I've suggested in the past that we might consider contacting some people and asking them to update their START profiles. ACL 2020 is asking them to do it right now anyway.

I just looked at the EMNLP 2019 list too and saw a couple of French names and an Indian name where the surname was in all caps. I agree that your heuristic is going to be 99% correct for names written in all caps.

But I think names like Balumurali AR are common enough to worry about. It won't do to put in an exception for names that are two or three letters long, because many Chinese names are also two or three letters long.

davidweichiang commented 4 years ago

Would it be too specific to apply your heuristic only to names that are written in Pinyin, which is very easy to check?

nschneid commented 4 years ago

I don't know how that is checked but it should cover most of the cases. Maybe the rest should require a manual decision to whitelist or truecase.

nschneid commented 4 years ago

And the manual decision can usually be made by checking the PDF. Even better if we could scrape the author capitalization from the PDF, but that might be too hard.

davidweichiang commented 4 years ago

We do have a script that scrapes from PDF. It is not run regularly, though. And sometimes authors use all caps in the PDF too.

The Pinyin filter would be a good 90% solution; my main worry is that a language specific rule could be perceived as discriminatory.

nschneid commented 4 years ago

Eh...it seems to me the status quo is (unintentionally) discriminatory against people whose surnames are sometimes entered in all-caps, because inconsistencies will make it harder to browse their work. And these are disproportionately people from China. So it makes sense to correct that, and ideally not in a way that hypercorrects South Asian abbreviated names.

Ideally this would be something that START would encourage authors to specify consistently in the first place ("You entered 'LU', but ACL style is to use only initial capitals within names. Did you mean 'Lu'?"). But we don't really have control over that.

davidweichiang commented 4 years ago

I agree that ideally this should happen earlier than ingestion into the Anthology, because names from START also appear in the conference website, handbook, etc.

davidweichiang commented 4 years ago

I tried a simpler version of these heuristics on the EMNLP 2018 authors, and it worked perfectly except for one possible false positive (the first name "cmcc"). The heuristic is:

If the first name is all lowercase, change it to title case (Python str.title() method).
If the first name is all uppercase and (is 4 chars or more or is a Pinyin syllable), change it to title case.
Similarly for the last name.

davidweichiang commented 4 years ago

FWIW, START does have a tool in the pub chair console for correcting case problems in both titles and authors. I don't know whether it is regularly used. It also makes some mistakes (e.g., III is converted to Iii, and di is not converted to Di even if part of a Chinese name). And presumably changes to author names are not propagated back up to the global profile.

davidweichiang commented 4 years ago

Running this heuristic on the current Anthology authors yields 872 corrections. There are some false positives, though. Some seem fixable (MAXWELL III -> Maxwell Iii) but some seem tougher, especially corporate authors like ARC A3 or TIPSTER SE/CM.

nschneid commented 4 years ago

Nice! Could we run this periodically and record the exceptions as having been manually checked?

davidweichiang commented 4 years ago

It would be a tedious process each time. I am hoping that START will incorporate something like this so we don't have to deal with it. But otherwise, it would make most sense, I think, to have it run automatically at ingestion time.

mjpost commented 3 years ago

With commit ab92b62a8eb2d99c88abaf8330dc11261cb382d6, ingest.py now prompts the ingestor to confirm capitalization, when it discovers all-lowercase or all-uppercase names. Truecasing would be a better approach, but I think this quick fix probably captures 99% of instances.

I agree START should do this, but it also seems fixable at ingestion time, which is a longer-term solution.

acl-org / acl-anthology

Author name truecasing #643