acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
426 stars 284 forks source link

Papers out of sync #128

Closed eseyffarth closed 5 years ago

eseyffarth commented 5 years ago

I was looking up a paper and found that the pdf link pointed to the wrong file:

The pdf at http://aclweb.org/anthology/W04-1907 should be Corpus-based Induction of an LFG Syntax-Semantics Interface for Frame Semantic Processing, but instead it's The HOLJ Corpus: supporting summarisation of legal texts. On the page for this title, the pdf link incorrectly points to Automated Induction of Sense in Context - they're all from the same venue, with different IDs. I'm not sure how far down this goes.

I'm wondering if it would be possible to detect, correct, and avoid errors like this automatically?

Might be technically related to #44. If the system can look into the pdf, it can also check the pdf's title against the metadata and see if there are any conflicts.

mjpost commented 5 years ago

Thanks for reporting this! It seems there is an off-by-one error for the whole volume, and that the "Editorial" is missing as a separate PDF.

This is definitely the kind of thing we should check for at ingestion time. I created an issue within the Data Integrity Project (which is looking for a lead!) to track this more general issue.

mjpost commented 5 years ago

141 reported a few more ingestions issues: