acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
414 stars 282 forks source link

LREC full author names #224

Closed davidweichiang closed 5 years ago

davidweichiang commented 5 years ago

(Original message replaced) L06 only uses first initials, even though the original papers have full names. This makes author indexing difficult. Unfortunately, the online proceedings (http://www.lrec-conf.org/proceedings/lrec2006/) are the same. Is there a way to get the full names other than manually entering them? Scraping them from the PDFs?

davidweichiang commented 5 years ago

Other meetings have the same issue, but L06 is by far the biggest.

1665 L06.xml
 291 H01.xml
  86 L04.xml
  81 W04.xml
  79 M98.xml
  75 L02.xml
  75 H90.xml
  69 L00.xml
  60 H91.xml
  54 C82.xml
mjpost commented 5 years ago

We could talk to the LREC folks. I'm not sure what system they use for conference management, but they might have the rest of the information available. I'll send an email.

mjpost commented 5 years ago

For LREC, at least, it looks like it was specific to that year. PDF scraping seems possible but also probably a research problem. I wonder if we could get funding for this, e.g., https://www.imls.gov/grants/available/national-leadership-grants-libraries.

Note that LREC has all the abstracts on their paper pages, which we don't have ingested. We could add them and it would probably help with indexing.

mjpost commented 5 years ago

Just found this: Semantic Scholar somehow has full author names, e.g., https://www.semanticscholar.org/paper/A-Cross-language-Approach-to-Rapid-Creation-of-New-Feldman-Hana/787bfd0ee5b5ccd891cf33bc84166d3a02a7b640

(They offered help, see note here: https://github.com/acl-org/acl-anthology/issues/208#issuecomment-477312810)

davidweichiang commented 5 years ago

It turns out that scraping from PDF is not very difficult! Later I will try running the scraper on some other conferences as well.