Closed nk-alex closed 7 months ago
I tried this approach
using (PdfDocument document = PdfDocument.Open(fileContent))
{
foreach (UglyToad.PdfPig.Content.Page page in document.GetPages())
{
IReadOnlyList<MarkedContentElement> markedContentElements = page.GetMarkedContents();
markedContentElements.OrderBy(mc => mc.MarkedContentIdentifier);
foreach (MarkedContentElement markedContentElement in markedContentElements)
{
foreach (var image in markedContentElement.Images)
{
pdf_text += ExtractTextFromImage(image.RawBytes.ToArray());
}
foreach (var letter in markedContentElement.Letters)
{
pdf_text += letter.Value;
}
}
}
}
But text is not quide ordered sequentially. What I mean by sequentially is ordered desc by y coordinate, then ordered asc by x coordinate.
@nk-alex First, I'm not sure why you're looking into the marked content of the page, but feel free to add details.
It seems your issue is threefold:
In order to later be able to order your plain text, a good approach could be to get each text blocks in the page. Have a look here https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#page-segmenters
PdfPig can't do that out of the box, you will need to do OCR on the images. One way to do OCR in C# is to use https://github.com/charlesw/tesseract
Once you have your text from above, and with a bit of work on the OCR output, you could order your blocks using one of PdfPig's Reading order detectors. Have a look here to know more https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#reading-order-detectors
Thank you for the support @BobLd and sorry for the delay. The information you provided was really useful. I ended up doing this for every single PDF page:
Extract TextBlocks from current page:
IEnumerable<Word> words = page.GetWords(NearestNeighbourWordExtractor.Instance);
IReadOnlyList<TextBlock> text_blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
Extract images from current page:
IEnumerable<IPdfImage> images = page.GetImages();
Extract text from those images using an OCR
Both TextBlock and IPdfImage have information about the bounding box. I just need to order those bounding boxes as I like.
So basically I have a pdf where I have a bunch of images with text in them and also I have plain text.
I would like to sequentially iterate over the elements in my PDF so I can extract all text (from images and from plain text) sequentially.
Is it possible with current implementation?
Greetings