XML tagging of PDFs is too faulty - Githubissues

earlng / academic-pdf-scrap

Code that scraps the contents of the PDF papers submitted for NeurIPS 2020

MIT License

4 stars 2 forks source link

XML tagging of PDFs is too faulty #14

Open earlng opened 3 years ago

earlng commented 3 years ago

Describe the bug The xml tagging is too faulty to be correctly scraped

To Reproduce

BIS title improperly coded, in xml appears after BIS content, and tagged as <region>

103303dd56a731e377d01f6a37badae3
6271faadeedd7626d661856b7a004e27
F5b1b89d98b7286673128a5fb112cb9a
f0bda020d2470f2e74990a07a607ebd9

entire BIS, title and content, contained within a tag (with other, non BIS content), the BIS is not scraped at all by the code

201d7288b4c18a679e48b31c72c30ded
8ab70731b1553f17c11a3bbc87e0b605
94d2a3c6dd19337f2511cdf8b4bf907e
2974788b53f73e7950e8aa49f3a306db
B139aeda1c2914e3b579aafd3ceeb1bd
Be23c41621390a448779ee72409e5f49
7a006957be65e608e863301eb98e1808

BIS title and content improperly and arbitrarily coded as <outsider> and <region>

1325cdae3b6f0f91a1b629307bf2d498

Proposed Fix No possible fix