allenai / papermage

library supporting NLP and CV research on scientific papers
https://papermage.org
Apache License 2.0
692 stars 54 forks source link

The parser stability check usually fails. #7

Closed bnewm0609 closed 1 year ago

bnewm0609 commented 1 year ago

When running pytest tests/test_parsers/test_pdf_plumber_parser.py, the test_parser_stability test usually fails. A work around is to overwrite the test fixture json with the doucment parse (tests/fixtures/2304.02623v1.json) before running the test by running:

parser = PDFPlumberParser()
doc = parser.parse(input_pdf_path="tests/fixtures/2304.02623v1.pdf")
with open("tests/fixtures/2304.02623v1.json", "w") as f:
    json.dump(doc.to_json(), f)

However, this ruins the point of having a stability test---the pdf parses won't be stable between runs. Can we make a better stability test?

bnewm0609 commented 1 year ago

This seems to be caused by my local version of pdf plumber being different from the one dowloaded during CI. Local Version: 0.10.1 CI Version: 0.7.8