Closed cfiorelli closed 7 months ago
It looks like there is a bug in our pipeline that is producing these duplicate annotations. From a small sample size, I found duplicates in 5% of papers, with annotations sometimes duplicated as much as 10x. The annotations are otherwise correct, so the best course is to remove them in your own code. We can fix the bug, but it may take a while for the affected papers to reprocess, and we won't change any already-released datasets.
cc Jessica Lam slack report
Describe the bug 1) I've noticed that many annotation spans in the bulk download (obtained on 11 Oct 2023) appear to be repeated. It's easy enough to fix on my end, I'm just wondering whether there's something else going on. 2) For many papers, the bibliography entry spans in the bulk download do not correspond with the references reported by the Papers API. For example, PaperId ea4de8e24447b3debfbe9e9c697ab2b66f6663b6 is referenced by doi:10.24940/theijst/2021/v9/i7/st2107-011 both in the PDF and according to the Papers API, but there is no bibliography entry span for the cited paper in the bulk download. Is this because the bibliography entry could not be detected?
To Reproduce
Expected behavior 1) Annotation spans are distinct 2) Bibliography entry spans do correspond with the references reported by the Papers API
Screenshots If applicable, add screenshots to help explain your problem.
Additional context Example output from above