acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
433 stars 293 forks source link

Correction: Diacritics missing from author name though present in PDF #333

Open nschneid opened 5 years ago

nschneid commented 5 years ago

Ironically, Diacritics Restoration Using Neural Networks lists "Jan Hajic" on the page and in the BibTeX whereas it's spelled "Jan Hajič" in the PDF.

I see he is listed with the diacritic in some venues but not others, though going by the PDFs, Jan Hajič seems to be the preferred spelling.

Should the policy be that if an author is listed with multiple spellings differing only in diacritics, the one with the most diacritics should be applied?

mjpost commented 5 years ago

The goal is generally for (a) BibTeX to reflect PDF and (b) author pages to collect all observed name variants. So this is a mistake that should be corrected in the XML.

Want to open a PR (and be added to our list of volunteers)?

nschneid commented 5 years ago

Before making a one-time change I'd like to understand the underlying problem. Could it be that his name is listed without the diacritic in START, so it is showing up that way in the metadata for many of the venues?

davidweichiang commented 5 years ago

I just checked -- indeed, his name in START is just Jan Hajic.

davidweichiang commented 5 years ago

Missing diacritics is a widespread problem that we hoped to sidestep by allowing name variants. I suppose one could try to write a scraper to try to detect them.

If we knew when the switch was made to using START names, then looking for frequent mismatches after that year would help to identify people to contact and ask them to consider updating their profile.

davidweichiang commented 5 years ago

I adapted the auto_first_names.py script and am running it on L18 now. It's catching quite a few errors; not just the one @nschneid pointed out, but removing extra accents, decapitalizing an all-caps name, and flagging (but not autocorrecting, alas) a couple of misspelled names.

davidweichiang commented 5 years ago

In L18 (528 papers, wow), the script made 150 changes (also wow) and printed another 100+ warnings that usually indicate a typo or missing word.

The automatic changes are easy to check, and they all look good except for a few:

INFO:L18-1066 author Tomasz Pędzimąż: changing: Pędzimąż -> Pȩdzima̧ż
INFO:L18-1495 author Anna Björk Nikulásdóttir: changing: Nikulásdóttir -> Nikulasdóttir
INFO:L18-1632 author Huda Almuzaini: changing: Almuzaini -> almuzaini

The first changes ogoneks to cedillas, I believe incorrectly. The second one looks incorrect to me based on a Google search. The third one lowercases the last name of someone who doesn't appear to do that regularly.

@mjpost, do you think the PDF should be followed in such cases?

davidweichiang commented 5 years ago

What system does LREC use to fill metadata? Do they use START also? I'm running the script on L16 now (for #341) and seeing some PDF/XML mismatches that are the same as in L18.

For example (not an exhaustive list):

Phillippe Langlais -> Philippe Langlais
Ina Roesiger -> Ina Rösiger

So it would be nice to fix these at the source instead of on our end.

kilian-gebhardt commented 5 years ago

Seems to be START for both years: http://lrec2016.lrec-conf.org/en/submission/ http://lrec2018.lrec-conf.org/en/submission/

davidweichiang commented 5 years ago

@mjpost what are your thoughts about editing XML to match PDF in these cases where the PDF has less information than the current XML:

  1. XML currently has Matt Post, PDF has Matt POST
  2. XML Matt Post, PDF M. Post
  3. XML Matt Post, PDF Mat Post
  4. XML Matt J. Post, PDF Matt Post
  5. XML Matt Póst, PDF Matt Post (supposing that the accent is correct)
  6. XML Matt Post, PDF matt post [Edit: numbered list] [Edit: 6]
mjpost commented 5 years ago
  1. I approve on the grounds of superseding another conference's convention

  2. I like only when it is clear that initials were used because of a conference-level editorial decision (in which case we are overriding their convention with our superior one). If this were a one-off, we don't have the evidence that this wasn't the author's choice.

  3. I approve as a typo correction

  4. I dislike, because there is no evidence that this is a correction. (And in particular, I strongly dislike my name being written as Matthew, Matthew J, Matt J, etc)

  5. Is murky but I think wrong. For example, the same corrective principle might change Koehn → Köhn, which would be wrong. We could set a general rule that acknowledges typing Latin-1 characters was harder say, 20 years ago, but I think it's more straightforward to list this as an ASCII variant.

Just to be clear, since my tone may indicate otherwise, we can discuss any of these.

danielgildea commented 5 years ago

As a general rule, I would say the xml should reflect how you would want to cite the paper, and not necessarily have to match the PDF 100% of the time. On that basis, I would say that the xml should have: 1) No full caps. 2) Full first name if we know that the author usually uses it, and this conference/paper just didn't allow it. 3) Typos fixed if we are absolutely sure it's a typo. 4) Middle initial and form "Matt" vs "Matthew" etc as they appear in the pdf. 5) Any diacritics if we are sure they are correct and are generally used by that person.

Unfortunately these rules require some research/judgment, but I think it is better to leave things the way they are in case of doubt than it is to exactly mirror the PDFs.

akoehn commented 5 years ago

@mjpost : Approve means that you would like to keep the XML data and not the PDF one, correct?

For example, the same corrective principle might change Koehn → Köhn, which would be wrong

As an expert in this field [ :-) ]: it depends. You cannot change an oe to ö without any evidence. However, if either the PDF or the XML actually has Köhn in it, it is probably safe to say that the umlaut is the preferred version. Case in point: Philipp Köhn spells his name Koehn in all publications and has this probably also in softconf. No algorithm would try to change it to Köhn. I try to use the umlaut, but in some cases have to enter an ascii-only name, so Koehn will be in some database as well.

Ina Roesiger -> Ina Rösiger

In this case, one should go with Rösiger.

mjpost commented 5 years ago

@akoehn—yes, approve means I was in favor of the XML diverging from the PDF in the cases mentioned above.

I like @danielgildea's concise summary above. We should start throwing conclusions from these discussions into a wiki page that describes our approach.

Just seeing (6) above: I think capitalization falls under the Gildea Principles: we correct it to English conventions unless we have evidence the author prefers it that way (e.g., danah boyd, e e cummings).

davidweichiang commented 5 years ago

I think I hear a consensus about

(3) Don't copy errors from the PDF into the XML. Note that this can occasionally be a tough call: for example, I had difficulty figuring out Elahe Khorasani vs. Elahe Khorashani.

(4-5) Assuming that neither the PDF or XML has an error, go with the PDF.

(1-2) Override styles (like first initials or all caps) imposed by a conference (which is rare).

But:

(1-2) There's less clarity about individual papers that use first initials or all caps -- @mjpost says go with the PDF.

So for the examples mentioned in this thread, and a couple more:

Consensus cases:

Not sure about whether these are considered typos or not:

And the cases where there is difference of opinion:

nschneid commented 5 years ago

(1-2) There's less clarity about individual papers that use first initials or all caps -- @mjpost says go with the PDF.

I'm not sure about this one. Why would an author choose to abbreviate their name in some publications but not others? It could be that the names in the PDF all follow one convention which is inconsistent with what some of the authors normally do. I would generally prefer more information over less information, so if the name was spelled out in START but abbreviated in the PDF, I'd go with the non-abbreviated version.

mjpost commented 5 years ago

I agree about having more information whenever possible; I just want us to have some degree of certainty about it, so that we don't "Gilbert Keith" someone's "GK". If we have information from another source (start ID, inference about a conference convention, etc) that suggests the author is fine with an evidenced fuller version, I'm fine with it. But part of the reason to have a strong preference for the PDF is that without that convention, one can spend endless time trying to figure out what's right in all these situations.

davidweichiang commented 5 years ago

@nschneid we previously discussed first initials at length in #245; @mjpost sorry to bring it up again. In the current situation (LREC and other conferences that use START), the full names are known to be correct because they are provided by the authors, so it seems especially sad to delete information (and indeed, I didn't do it in PR #340).

I will try to summarize the above discussion in the wiki, and I will back out the changes in L18 that made some last names all-caps.

davidweichiang commented 5 years ago

Do you want to further discuss how to get people to change their names in START? If not, we can close this issue.

davidweichiang commented 5 years ago

I think we can pretty reliably restore accents now by scraping them from PDFs. What's the best way to use this -- to identify people to ask to update their START accounts, or just run the scraper as part of ingestion?

The scraper also changes casing and inserts/deletes spaces and hyphens. But it can only flag, not autocorrect, changes in spelling or insertion/deletion of names or initials.

danielgildea commented 5 years ago

As far as getting people to update their names in START, it seems like there are a few things we might try: 1) Try to get everyone's emails from START, and send emails to people with a mismatch. 2) Ask ACL organizers to include a note in their email to authors about checking that the names in START match what people want, possibly including the authors' names from START in the email so that people can see easily how their names appear now. 3) Provide pub chairs with a script to check names against the PDFs, so that they can edit the metadata and possibly bug the authors themselves. Any thoughts on which of these to pursue?

mjpost commented 5 years ago

I think we should focus our efforts on implementing this ourself when we generate the XML (say in anthology_xml.py. Reasoning:

(1) and (2) are still good ideas to reduce the amount that has to be fixed, though.

davidweichiang commented 5 years ago

I agree, it would be annoying for everyone involved to email individual people. So, we have an author-name scraper (https://github.com/acl-org/acl-anthology/blob/auto_accents/bin/auto_authors.py) that could be incorporated into normalize_anth.py and run as part of ingestion.

mjpost commented 5 years ago

@davidweichiang, do you want a local copy of the Anthology PDFs? It's 35 GB. If you have a CLSP account we could set this up, or find another way.

davidweichiang commented 5 years ago

I don’t have a CLSP account (I don’t think). But a local copy might be a good idea if we can figure out a way.

akoehn commented 5 years ago

Short cross link: https://github.com/acl-org/acl-anthology/issues/295#issuecomment-494909877 for a discussion of how to mirror PDFs in bulk. Should be ~5mins to implement.

mjpost commented 5 years ago

I've posted a file with checksums here [14 MB].

davidweichiang commented 5 years ago

Can this file (as well as the mirroring script) become part of the repo?

akoehn commented 5 years ago

Can we discuss that further in #295 (the mirroring issue)? I can write the script & create a pull request later today; I am currently on a train with limited bandwidth. Adding the checksums file to the repository seems like a good idea to me.

danielgildea commented 4 years ago

Hi,

I just ran find_name_variants.py, which finds names that slugify to the same thing. It found over 300 cases of people with essentially the same name that are currently considered to be different people in the database because they are not entered into name_variants.yaml. Most are missing accents, and some are different first/last split for multiword names. It looks like in all cases it is the same person.

I wonder if we should change the anthology code to consider any two names that slugify to the same thing to be the same person. That way these people could have one author page without us having to track down every name variant during ingestion, which we don't have any consistent process for currently.

mjpost commented 4 years ago

I like that idea. We should wrap up discussion in #623 and come up with a solution.