earlng / academic-pdf-scrap

Code that scraps the contents of the PDF papers submitted for NeurIPS 2020
MIT License
4 stars 2 forks source link

XML tagging of PDFs is too faulty #14

Open earlng opened 3 years ago

earlng commented 3 years ago

Describe the bug The xml tagging is too faulty to be correctly scraped

To Reproduce

BIS title improperly coded, in xml appears after BIS content, and tagged as <region>

entire BIS, title and content, contained within a tag (with other, non BIS content), the BIS is not scraped at all by the code

BIS title and content improperly and arbitrarily coded as <outsider> and <region>

Proposed Fix No possible fix