Open stephen-williamson opened 5 months ago
@stephen-williamson Thanks for sharing the document. The main issue I see with your document is that the page contains about 2 million letters.... NearestNeighbourWordExtractor
was not designed to handle that many letters.
Fixing that involves a deep optimisation of the layout analysis algos. The document you provided will be very usefull for benchmarking though
after further analysis, the letter count can be brought down to 300k by only taking in account the ones that are within the boundary of the page
Related to #681
0020.pdf
I am having a issue with a given PDF, The pdf itself is larger than most that I use pdfPig for. at round 13mb (normally my pdfs are <1mb) It takes longer than normal to call the
GetPage()
method (about 5 seconds instead of instant) but it does succeed. While theGetWords()
method hangs for a long time (multiple minutes) before eventfully fully crashing.In that time, memory has shot right up, I end up with 1.5GB GC Heap Size and around 5GiB Allocation Rate looking at the diagnostics session in visual studio.
I cannot even catch the error with a try catch,
Any help would be great, even if it was just to be able to catch the crash nicely. I've attached a snapshot of the memory