UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.66k stars 234 forks source link

Memory Issues on GetWords() and crashes with given file #820

Open stephen-williamson opened 5 months ago

stephen-williamson commented 5 months ago

0020.pdf

I am having a issue with a given PDF, The pdf itself is larger than most that I use pdfPig for. at round 13mb (normally my pdfs are <1mb) It takes longer than normal to call the GetPage() method (about 5 seconds instead of instant) but it does succeed. While the GetWords() method hangs for a long time (multiple minutes) before eventfully fully crashing.

In that time, memory has shot right up, I end up with 1.5GB GC Heap Size and around 5GiB Allocation Rate looking at the diagnostics session in visual studio.

I cannot even catch the error with a try catch,
Any help would be great, even if it was just to be able to catch the crash nicely. I've attached a snapshot of the memory

var path = @"C:\Users\stephen.williamson\Downloads\0020.pdf";

using (var document = PdfDocument.Open(path))
{
    for (var i = 0; i < document.NumberOfPages; i++)
    {
        var page = document.GetPage(i + 1); //This line takes about 5 seconds

        var words = page.GetWords(NearestNeighbourWordExtractor.Instance); //Here it crashes but if i remove the Parameter, it willcrash on the next line instead
        var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
        var orderedBlocks = DefaultReadingOrderDetector.Instance.Get(blocks);

        Console.WriteLine("((TEXT SECTION))");

        foreach (var block in orderedBlocks)
        {
            Console.WriteLine("==BLOCK==");
            Console.WriteLine(block.Text);

            // Do something
        }
    }
}

image

BobLd commented 4 months ago

@stephen-williamson Thanks for sharing the document. The main issue I see with your document is that the page contains about 2 million letters.... NearestNeighbourWordExtractor was not designed to handle that many letters.

Fixing that involves a deep optimisation of the layout analysis algos. The document you provided will be very usefull for benchmarking though

BobLd commented 4 months ago

after further analysis, the letter count can be brought down to 300k by only taking in account the ones that are within the boundary of the page

Related to #681