UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.66k stars 237 forks source link

Sequentally extract text and images #778

Closed nk-alex closed 7 months ago

nk-alex commented 7 months ago

So basically I have a pdf where I have a bunch of images with text in them and also I have plain text.

I would like to sequentially iterate over the elements in my PDF so I can extract all text (from images and from plain text) sequentially.

Is it possible with current implementation?

Greetings

nk-alex commented 7 months ago

I tried this approach

 using (PdfDocument document = PdfDocument.Open(fileContent))
 {
     foreach (UglyToad.PdfPig.Content.Page page in document.GetPages())
     {
         IReadOnlyList<MarkedContentElement> markedContentElements = page.GetMarkedContents();
         markedContentElements.OrderBy(mc => mc.MarkedContentIdentifier);

         foreach (MarkedContentElement markedContentElement in markedContentElements)
         {
             foreach (var image in markedContentElement.Images)
             {
                 pdf_text += ExtractTextFromImage(image.RawBytes.ToArray());
             }

             foreach (var letter in markedContentElement.Letters)
             {
                 pdf_text += letter.Value;
             }
         }
     }
 }

But text is not quide ordered sequentially. What I mean by sequentially is ordered desc by y coordinate, then ordered asc by x coordinate.

BobLd commented 7 months ago

@nk-alex First, I'm not sure why you're looking into the marked content of the page, but feel free to add details.

It seems your issue is threefold:

Extract plain text from the page

In order to later be able to order your plain text, a good approach could be to get each text blocks in the page. Have a look here https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#page-segmenters

Extract text from the images

PdfPig can't do that out of the box, you will need to do OCR on the images. One way to do OCR in C# is to use https://github.com/charlesw/tesseract

Order text from both points above

Once you have your text from above, and with a bit of work on the OCR output, you could order your blocks using one of PdfPig's Reading order detectors. Have a look here to know more https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#reading-order-detectors

nk-alex commented 7 months ago

Thank you for the support @BobLd and sorry for the delay. The information you provided was really useful. I ended up doing this for every single PDF page:

  1. Extract TextBlocks from current page:

        IEnumerable<Word> words = page.GetWords(NearestNeighbourWordExtractor.Instance);
        IReadOnlyList<TextBlock> text_blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
  2. Extract images from current page: IEnumerable<IPdfImage> images = page.GetImages();

  3. Extract text from those images using an OCR

  4. Both TextBlock and IPdfImage have information about the bounding box. I just need to order those bounding boxes as I like.