Limiting parsed page size to 1.28 million chars

Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations

Apache License 2.0

6.44k stars 618 forks source link

Limiting parsed page size to 1.28 million chars #592

Closed jamesbraza closed 1 month ago

jamesbraza commented 1 month ago

We encountered a PDF (for DOI 10.1038/s41593-019-0491-3) with PyMuPDF==1.24.11 where one of the pages was parsed to about 400 million characters, of junk (mainly whitespace). This didn't actually crash us, but takes hours to index.

As this was junk, this PR moves us to discard PDF that we fail to parse to be a reasonable amount of text (1.28 million chars), which is already 10X greater than a 128k token context window (and let's ignore chars != tokens in this 10X)

whitead commented 1 month ago

Good idea! Why did you call it page limit but compare it with number of characters?

jamesbraza commented 1 month ago

Good idea! Why did you call it page limit but compare it with number of characters?

We talked in person, it was a bad name. I renamed to page_size_limit