Closed eseyffarth closed 5 years ago
Thanks for reporting this! It seems there is an off-by-one error for the whole volume, and that the "Editorial" is missing as a separate PDF.
This is definitely the kind of thing we should check for at ingestion time. I created an issue within the Data Integrity Project (which is looking for a lead!) to track this more general issue.
I was looking up a paper and found that the pdf link pointed to the wrong file:
The pdf at http://aclweb.org/anthology/W04-1907 should be Corpus-based Induction of an LFG Syntax-Semantics Interface for Frame Semantic Processing, but instead it's The HOLJ Corpus: supporting summarisation of legal texts. On the page for this title, the pdf link incorrectly points to Automated Induction of Sense in Context - they're all from the same venue, with different IDs. I'm not sure how far down this goes.
I'm wondering if it would be possible to detect, correct, and avoid errors like this automatically?
Might be technically related to #44. If the system can look into the pdf, it can also check the pdf's title against the metadata and see if there are any conflicts.