Closed davidweichiang closed 5 years ago
Other meetings have the same issue, but L06 is by far the biggest.
1665 L06.xml
291 H01.xml
86 L04.xml
81 W04.xml
79 M98.xml
75 L02.xml
75 H90.xml
69 L00.xml
60 H91.xml
54 C82.xml
We could talk to the LREC folks. I'm not sure what system they use for conference management, but they might have the rest of the information available. I'll send an email.
For LREC, at least, it looks like it was specific to that year. PDF scraping seems possible but also probably a research problem. I wonder if we could get funding for this, e.g., https://www.imls.gov/grants/available/national-leadership-grants-libraries.
Note that LREC has all the abstracts on their paper pages, which we don't have ingested. We could add them and it would probably help with indexing.
Just found this: Semantic Scholar somehow has full author names, e.g., https://www.semanticscholar.org/paper/A-Cross-language-Approach-to-Rapid-Creation-of-New-Feldman-Hana/787bfd0ee5b5ccd891cf33bc84166d3a02a7b640
(They offered help, see note here: https://github.com/acl-org/acl-anthology/issues/208#issuecomment-477312810)
It turns out that scraping from PDF is not very difficult! Later I will try running the scraper on some other conferences as well.
(Original message replaced) L06 only uses first initials, even though the original papers have full names. This makes author indexing difficult. Unfortunately, the online proceedings (http://www.lrec-conf.org/proceedings/lrec2006/) are the same. Is there a way to get the full names other than manually entering them? Scraping them from the PDFs?