Open GuyAglionby opened 10 months ago
Sounds good! There are also several datasets already out there that provide plain text for Anthology papers via Grobid or similar, have you checked if working with these might be easier? (Can dig up links if you want)
Great! I found https://huggingface.co/datasets/WINGNUS/ACL-OCL, but it's only updated til Sep'22. I'll get in touch with them to update it or otherwise figure something out.
I've noticed that the abstract metadata for many papers doesn't match the pdfs in many cases. I quickly checked the first 100 papers from ACL 2021 and 19 had discrepancies. Most of these were minor but a few had been more substantially changed.
I have a script using Grobid that checks these, but it needs some more work as it currently introduces a few errors. I'll submit a PR hopefully in the next few weeks, but wanted to post here first to get feedback and to check that it doesn't collide with any existing projects :)
Paper ids with changes: 3, 4, 5, 6, 9, 14, 16, 20, 21, 29, 31, 32, 52, 55, 64, 76, 77, 78, 89