Sequentally extract text and images

UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)

https://github.com/UglyToad/PdfPig/wiki

Apache License 2.0

1.66k stars 237 forks source link

Sequentally extract text and images #778

Closed nk-alex closed 7 months ago

nk-alex commented 7 months ago

So basically I have a pdf where I have a bunch of images with text in them and also I have plain text.

I would like to sequentially iterate over the elements in my PDF so I can extract all text (from images and from plain text) sequentially.

Is it possible with current implementation?

Greetings

nk-alex commented 7 months ago

I tried this approach

 using (PdfDocument document = PdfDocument.Open(fileContent))
 {
     foreach (UglyToad.PdfPig.Content.Page page in document.GetPages())
     {
         IReadOnlyList<MarkedContentElement> markedContentElements = page.GetMarkedContents();
         markedContentElements.OrderBy(mc => mc.MarkedContentIdentifier);

         foreach (MarkedContentElement markedContentElement in markedContentElements)
         {
             foreach (var image in markedContentElement.Images)
             {
                 pdf_text += ExtractTextFromImage(image.RawBytes.ToArray());
             }

             foreach (var letter in markedContentElement.Letters)
             {
                 pdf_text += letter.Value;
             }
         }
     }
 }

But text is not quide ordered sequentially. What I mean by sequentially is ordered desc by y coordinate, then ordered asc by x coordinate.

BobLd commented 7 months ago

@nk-alex First, I'm not sure why you're looking into the marked content of the page, but feel free to add details.

It seems your issue is threefold:

Extract plain text from the page
Extract text from the images
Order text from both points above

Extract plain text from the page

In order to later be able to order your plain text, a good approach could be to get each text blocks in the page. Have a look here https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#page-segmenters

Extract text from the images

PdfPig can't do that out of the box, you will need to do OCR on the images. One way to do OCR in C# is to use https://github.com/charlesw/tesseract

Order text from both points above

Once you have your text from above, and with a bit of work on the OCR output, you could order your blocks using one of PdfPig's Reading order detectors. Have a look here to know more https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#reading-order-detectors

nk-alex commented 7 months ago

Thank you for the support @BobLd and sorry for the delay. The information you provided was really useful. I ended up doing this for every single PDF page:

Extract TextBlocks from current page:

    IEnumerable<Word> words = page.GetWords(NearestNeighbourWordExtractor.Instance);
    IReadOnlyList<TextBlock> text_blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);

Extract images from current page: IEnumerable<IPdfImage> images = page.GetImages();
Extract text from those images using an OCR
Both TextBlock and IPdfImage have information about the bounding box. I just need to order those bounding boxes as I like.