Closed jamesbraza closed 1 month ago
Good idea! Why did you call it page limit but compare it with number of characters?
Good idea! Why did you call it page limit but compare it with number of characters?
We talked in person, it was a bad name. I renamed to page_size_limit
We encountered a PDF (for DOI
10.1038/s41593-019-0491-3
) withPyMuPDF==1.24.11
where one of the pages was parsed to about 400 million characters, of junk (mainly whitespace). This didn't actually crash us, but takes hours to index.As this was junk, this PR moves us to discard PDF that we fail to parse to be a reasonable amount of text (1.28 million chars), which is already 10X greater than a 128k token context window (and let's ignore chars != tokens in this 10X)