acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
384 stars 256 forks source link

Paper metadata correction across the anthology #2805

Open GuyAglionby opened 10 months ago

GuyAglionby commented 10 months ago

I've noticed that the abstract metadata for many papers doesn't match the pdfs in many cases. I quickly checked the first 100 papers from ACL 2021 and 19 had discrepancies. Most of these were minor but a few had been more substantially changed.

I have a script using Grobid that checks these, but it needs some more work as it currently introduces a few errors. I'll submit a PR hopefully in the next few weeks, but wanted to post here first to get feedback and to check that it doesn't collide with any existing projects :)

Paper ids with changes: 3, 4, 5, 6, 9, 14, 16, 20, 21, 29, 31, 32, 52, 55, 64, 76, 77, 78, 89

mbollmann commented 10 months ago

Sounds good! There are also several datasets already out there that provide plain text for Anthology papers via Grobid or similar, have you checked if working with these might be easier? (Can dig up links if you want)

GuyAglionby commented 10 months ago

Great! I found https://huggingface.co/datasets/WINGNUS/ACL-OCL, but it's only updated til Sep'22. I'll get in touch with them to update it or otherwise figure something out.