UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.57k stars 225 forks source link

XYLeaf.GetLines collect lines not robust enough #844

Open hjy1210 opened 3 weeks ago

hjy1210 commented 3 weeks ago

I tryed to extract text from history-2-3.pdf.

Part image of the history-2-3.pdf is history-2-3-item1 It is a question item with item label "1." , item label followed by question body "新北市金瓜石的國際終戰和平紀念園區...".

But the extracted text are:

新北市金瓜石的國際終戰和平紀念園區,立有一面紀念牆,上面刻有第二次世界大戰時拘
1.
留在此的盟軍戰俘姓名與國籍。這些戰俘中,許多是紐西蘭及加拿大軍人。他們的國家並
未遭到日軍攻擊,卻仍參戰,其原因為何?
(A)受雇到海外作戰的傭兵 (B)由大英國協動員的軍隊
(C)反法西斯主義的志願軍 (D)聯合國維和部隊的成員

Note: the item label 1. is followed the first line 新北市金瓜石的國際終戰和平紀念園區,立有一面紀念牆,上面刻有第二次世界大戰時拘 of the question content. This phenomenon will cause trouble when searching text.

Following is my code:

            string sourcePdfPath = "history-2-3.pdf";
            string outputPdfPath = "history-2-3-pigboxed.pdf";
            string outputTextPath = "history-2-3-pigboxed.txt";

            using (var document = PdfDocument.Open(sourcePdfPath))
            {
                var builder = new PdfDocumentBuilder { };
                PdfDocumentBuilder.AddedFont font = builder.AddStandard14Font(Standard14Font.Helvetica);
                StringBuilder sb = new StringBuilder();
                for (var pageNumber = 1; pageNumber <= document.NumberOfPages; pageNumber++)
                {
                    var pageBuilder = builder.AddPage(document, pageNumber);
                    pageBuilder.SetStrokeColor(128, 128, 128);
                    var page = document.GetPage(pageNumber);

                    var letters = page.Letters; // no preprocessing

                    // 1. Extract words
                    var wordExtractor = NearestNeighbourWordExtractor.Instance;

                    var words = wordExtractor.GetWords(letters);

                    // 2. Segment page
                    var recursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
                    {
                        MinimumWidth = page.Width 
                    });
                    var textBlocks = recursiveXYCut.GetBlocks(words);
                    // 3. Postprocessing
                    var readingOrder = UnsupervisedReadingOrderDetector.Instance;
                    var orderedTextBlocks = readingOrder.Get(textBlocks);
                    // 4. Add debug info - Bounding boxes and reading order
                    foreach (var block in orderedTextBlocks)
                    {
                        var bbox = block.BoundingBox;
                        pageBuilder.DrawRectangle(bbox.BottomLeft, bbox.Width, bbox.Height);
                        pageBuilder.AddText($"{pageNumber}-{block.ReadingOrder.ToString()}", 8, bbox.TopLeft, font);
                        sb.AppendLine(block.Text);
                    }
                }

                // 5. Write result to a file
                byte[] fileBytes = builder.Build();
                if (!string.IsNullOrEmpty(outputPdfPath))
                    File.WriteAllBytes(outputPdfPath, fileBytes); // save to file
                if (!string.IsNullOrEmpty(outputTextPath))
                    File.WriteAllText(outputTextPath, sb.ToString());
            }

I guess this issue is due to: XYLeaf.GetLines use Words.Groupy(x => x.BoundingBox.Bottom) is not robust if words baselines/bottoms have some tiny differences.

davebrokit commented 3 weeks ago

Looks like it would be reading order detector causing that. That is what orders the text.

Have you tried using reading order detector with a higher T?

var readingOrder = new UnsupervisedReadingOrderDetector(10, UnsupervisedReadingOrderDetector.SpatialReasoningRules.RowWise);

See https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#unsupervised-reading-order-detector

hjy1210 commented 3 weeks ago

@davebrokit The code use RecursiveXYCut with MinimumWidth = page.Width, so there is single TextBlock in each page. I guess this issue is noting to do with reading order of TextBlocks. If, after statement var letters = page.Letters; add

                    foreach (var letter in letters)
                    {
                        Console.WriteLine($"{letter}");
                    }

We can get following informations on console:

...
1 (x:63.864, y:672.67) TimesNewRomanPSMT 11.04
. (x:69.624, y:672.67) TimesNewRomanPSMT 11.04
  (x:72.744, y:672.67) TimesNewRomanPSMT 11.04
新 (x:82.344, y:673.15) BCDHEE+PMingLiU 11.04
北 (x:93.84767999999998, y:673.15) BCDHEE+PMingLiU 11.04
市 (x:105.35135999999999, y:673.15) BCDHEE+PMingLiU 11.04
...

we can see the y coordinate of label 1 is 672.67, but the y coordinate of first letter of item content is 673.15. Because 672.67 is 0.48 points lower than 673.15, so the TextLine of 1. is followed after TextLine of 新北市... when XYLeaf.GetLines collect lines.

I guess may be this issue is caused by different fonts: the font of 1. is TimesNewRomanPSMT, the font of 新北市... is BCDHEE+PMingLiU That is why I use the title XYLeaf.GetLines collect lines not robust enough

davebrokit commented 2 weeks ago

Ah understood. Your code will only produce one block.

It's not the responsibilty of RecursiveXYCut to interpret the reading order of the text. It reads text in effectively a rendering order so depending on how your pdf is rendered it's results will vary. Aka it's not reliable for determing text order.

However, it is the responsibilty of Reading Order Detector to interpert the reading order of the text. Unfortunately it can't do that when there is just one block for the page.

What you need to do is chop up the page into smaller text blocks and then use reading order detector to order the blocks correctly. The code you have here:

                    // 3. Postprocessing
                    var readingOrder = UnsupervisedReadingOrderDetector.Instance;
                    var orderedTextBlocks = readingOrder.Get(textBlocks);

Example of choping up into smaller blocks:

            var recursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
            {
                DominantFontWidthFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Width) * 2,
                DominantFontHeightFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Height)/2
            });

Result image

You're doing Document layout analysis. The wiki here has a good guide about it: https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis

davebrokit commented 2 weeks ago

Please note there was a fix to UnsupervisedReadingOrderDetector recently so you need to use the nightly build

see #842

hjy1210 commented 2 weeks ago

@davebrokit Even use your options to choping into small text blocks as below history-2-3-item1-2 Now the item label 1. and first line of item conetent 新北市... are in the 9-th block of page 1. the extracted text is STILL as follows:

新北市金瓜石的國際終戰和平紀念園區,立有一面紀念牆,上面刻有第二次世界大戰時拘
1.
留在此的盟軍戰俘姓名與國籍。這些戰俘中,許多是紐西蘭及加拿大軍人。他們的國家並
未遭到日軍攻擊,卻仍參戰,其原因為何?

Note: The label 1. is still followed the first line of item content 新北市.... PS: I use PdfPig(0.1.9-alpha-20240612-d2cae)

I think UnsupervisedReadingOrderDetector is about order between-textblocks, but this issue is about within-textblocks.

davebrokit commented 2 weeks ago

Try cutting into smaller blocks so the 1. is in a block by itself.

This might work: DominantFontWidthFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Width) or try l.GlyphRectangle.Width/2

var recursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
{
    DominantFontWidthFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Width),
    DominantFontHeightFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Height)/2
});
hjy1210 commented 2 weeks ago

@davebrokit cutting into smaller blocks so the 1. is in a block by itself. will make full page with too many small textblocks. Every text in Textblock ends with newline, causing 1. and 新北市...` NOT in the same line, then the shape of extracted text in one page VERY not Similar to the original shape of pdf. I want keep the shape of extracted text similar to the original pdf.

davebrokit commented 1 week ago

@BobLd Please chime in on this discussion. He wrote the RecursiveXYCut code

Solving for your specific case

Your use of RecursiveXYCut was confusing the issue for me. As you are only getting one block it doesn't really make sense to use it in your example.

Here is the mimimum code I got to get your example working. RecursiveXYCut is not used and I do GroupBy(x => (int) x.BoundingBox.Bottom/5) which works in this case but is not very robust around the boundaries of the groups (aka you can still have your problem where there is a 0.001 difference between 2 coordinates and they fall into two different buckets)

var letters = page.Letters; // no preprocessing

// 1. Extract words
var wordExtractor = NearestNeighbourWordExtractor.Instance;

var words = wordExtractor.GetWords(letters);

// 2. Segment page into lines
var lines = words.GroupBy(x => (int) x.BoundingBox.Bottom/5)
    .Select(x => new TextLine(x.OrderByReadingOrder()))
    .OrderByReadingOrder();
var cleanedTextBlocks = new List<TextBlock>() { new TextBlock(lines) };

// 3. Postprocessing. Not sure it's necessary
var orderedTextBlocks = readingOrder.Get(cleanedTextBlocks);

Now with the general case:

RecursiveXYCut is a general algorithm designed to cut the page into blocks. Order within the block is not guaranteed. A naive and fast implementation of ordering into lines within a Block has been implemented for our convience. This works with the general case.

words.GroupBy(x => (int) x.BoundingBox.Bottom/tolerance) where tolerance is a parameter passed through. But you still have issues with lines where the points used fall on the boundaries and end up in two different buckets.

var letters = page.Letters; // no preprocessing

// 1. Extract words
var wordExtractor = NearestNeighbourWordExtractor.Instance;
var words = wordExtractor.GetWords(letters);

// 2. Segment page into blocks
var pageRecursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
{
    MinimumWidth = page.Width
});
var textBlocks = pageRecursiveXYCut.GetBlocks(words);

// 2. Segment blocks into lines
var lineRecursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
{
    DominantFontHeightFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Height) / 6,
    DominantFontWidthFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Width) * 60,
    //MinimumWidth = page.Width - This does work in this case. We have to use Dominant Font Width to handle it
});

var cleanedTextBlocks = new List<TextBlock>();
foreach (var block in textBlocks)
{
    // 2.1 Segmenet into lines
    var resLines = new List<TextLine>();
    var wordsInBlock = block.TextLines.SelectMany(x => x.Words);
    var lines = lineRecursiveXYCut.GetBlocks(wordsInBlock);

    // 2.2 Reorder text in lines
    foreach (var line in lines)
    {
        var blockWords = line.TextLines.SelectMany(x => x.Words).ToArray();
        var orderedWordsInLine = new TextLine(blockWords.OrderByReadingOrder());
        resLines.Add(orderedWordsInLine);
    }

    var newBlock = new TextBlock(resLines.OrderByReadingOrder());
    cleanedTextBlocks.Add(newBlock);
}

// 3. Postprocessing
var orderedTextBlocks = readingOrder.Get(cleanedTextBlocks);

Please let me know if you have other solutions to the problem

Conclusion

I would argue that there is value in doing

If @BobLd is in agreement do you fancy raising a PR for that :). You would be a legend!

Please let me know if I've missed something or you have a solution to the problem :)

p.s. We might also need more general/user friendly functions such as a line segmenter. The library has the building blocks but right now you need to stitch them together yourself when it falls outside of the general cases.

hjy1210 commented 1 week ago

Yes, BobLd's solution "A robust implementation would require using RecursiveXYCut on the blocks to cut into single lines. Then reordering the words within the line. " is what I want.

davebrokit commented 1 week ago

I don't think that's a good solution due to it's inefficency and you're effectively using RecursiveXYCut on it's self. It's also something you can implement yourself for your specific case but it would be overkill for most people in most cases.

But we'll wait for wiser heads like @BobLd to chime in :)

BobLd commented 1 week ago

@davebrokit I agree with you, RecursiveXYCut should not be used on itself and should only do the segmentation at block level as intented.

In order to fix the issue raised here, I think we need to give the possibility to end users to provide their own OrderByReadingOrder() - which I think is the culprit here (since even when the line is correctly build, the digit is not ordered correctly).

I did not allow for that when first wrote these classes as to keep to development simple - but now is a good time to think about how to expend.

On a different note, adding an ILineSegmenter class might not be as easy as you'd think because some algos like Docstrum are bottom up, and other like RecursiveXYCut are top down. But more than happy to discuss that on a different issue if you have ideas you'd wnat to share

davebrokit commented 1 week ago

@BobLd Thanks for the response

I agree with the need to give the possibility to end users to provide their own OrderByReadingOrder() but that would not fix the issue raised here as the problem comes from the line segmenter words.GroupBy(x => x.BoundingBox.Bottom). In this issue BoundingBox.Bottom of the 1. and the rest of the words on the line have a slight difference which means they are treated as 2 different lines.

And totally agree that adding a general ILineSegmenter is not easy. I've tried :). What works tends to be specific to the pdf document.

What I propose we do is:

  1. Allow users to pass their own reading order detector as you suggested

  2. Add a int line segmentation tolerance parameter to RecursiveXYCutOptions. When provided we use this line segmenter instead: words.GroupBy(x => (int) x.BoundingBox.Bottom/tolerance) (Note the use of integer rounding to allow minor variances in BoundingBox.Bottom to be ignored. It won't break existing code and tests as it only gets used when specified).

  3. Allow users to pass their own line segmenter function to use (would default to 1., this allows you to override it if you need more control). Func<IEnumerable<Word>, IEnumerable<TextLine>>. This is optional so happy to leave out if you don't agree.

I can make those code changes if you're happy with that

hjy1210 commented 1 week ago

Instead of words.GroupBy(x => x.BoundingBox.Bottom), may be we can

  1. first order words by BoundingBox.Bottom
  2. then group word based on criterion:range length of BoundingBox.Bottom of word in group <= T, where T is a nonnegative tolerance parameter.

Note: T=0 is the original implementation

davebrokit commented 1 week ago

@hjy1210 Thanks for that suggestion.

Step 2. is actually hard to implement so if you have code that can do that please share. You effectively need to cluster together words and that is what RecursiveXYCut and Docstrum algorithms try to solve :).

BobLd commented 1 week ago

@davebrokit happy for you to create a PR for

Add a int line segmentation tolerance parameter to RecursiveXYCutOptions.

hjy1210 commented 1 week ago

@davebrokit I use IReadOnlyList<TextLine> GetLines(IReadOnlyList<Word> words, string wordSeparator, double tolerance=0) to construct lines in TextBlock to demonstrate how to modify XYLeaf's GetLines. Because TextBlock have no property Words as XYLeaf, I use IReadOnlyList<Word> GetWordsInBlock(TextBlock block, IEnumerable<Word> words) to get words contained in TextBlock.

Following code is my demo:

        [Test]
        public void TestPigTextLineIssue6()
        {
            /// https://github.com/UglyToad/PdfPig/issues/844
            string sourcePdfPath = "history-2-3.pdf";
            string outputPdfPath = "history-2-3-pigline6boxed.pdf";
            string outputTextPath = "history-2-3-pigline6boxed.txt";

            using (var document = PdfDocument.Open(sourcePdfPath))
            {
                var builder = new PdfDocumentBuilder { };
                PdfDocumentBuilder.AddedFont font = builder.AddStandard14Font(Standard14Font.Helvetica);
                StringBuilder sb = new StringBuilder();
                for (var pageNumber = 1; pageNumber <= document.NumberOfPages; pageNumber++)
                {
                    var pageBuilder = builder.AddPage(document, pageNumber);
                    pageBuilder.SetStrokeColor(128, 128, 128);
                    var page = document.GetPage(pageNumber);

                    var letters = page.Letters; // no preprocessing

                    // 1. Extract words
                    var wordExtractor = NearestNeighbourWordExtractor.Instance;

                    var words = wordExtractor.GetWords(letters);  // words may contains word with single space
                    foreach (var word in words)
                    {
                        pageBuilder.DrawRectangle(word.BoundingBox.BottomLeft, word.BoundingBox.Width, word.BoundingBox.Height);
                    }

                    // 2. Segment page
                    var recursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
                    {
                        MinimumWidth = page.Width
                    });
                    var textBlocks = recursiveXYCut.GetBlocks(words);
                    // 3. Postprocessing
                    var readingOrder = UnsupervisedReadingOrderDetector.Instance;
                    var orderedTextBlocks = readingOrder.Get(textBlocks);
                    // 4. Add debug info - Bounding boxes and get text from textblock
                    foreach (var block in orderedTextBlocks)
                    {
                        var bbox = block.BoundingBox;
                        pageBuilder.DrawRectangle(bbox.BottomLeft, bbox.Width, bbox.Height);
                        pageBuilder.AddText($"{pageNumber}-{block.ReadingOrder.ToString()}", 8, bbox.TopLeft, font);
                        // following two line are the main code
                        var ws = GetWordsInBlock(block, words);
                        var lines = GetLines(ws, " ");
                        foreach (var line in lines)
                        {
                            sb.AppendLine(line.Text);
                        }
                    }
                }

                // 5. Write result to a file
                byte[] fileBytes = builder.Build();
                if (!string.IsNullOrEmpty(outputPdfPath))
                    File.WriteAllBytes(outputPdfPath, fileBytes); // save to file
                if (!string.IsNullOrEmpty(outputTextPath))
                    File.WriteAllText(outputTextPath, sb.ToString());
            }

            IReadOnlyList<Word> GetWordsInBlock(TextBlock block, IEnumerable<Word> words)
            {
                var rect = block.BoundingBox;
                // enlarge rect to prevent mathmatical computing rounding error problem
                var bbx = new PdfRectangle(rect.Left - 1, rect.Bottom - 1, rect.Right + 1, rect.Top + 1);
                return words.Where(w=>bbx.Contains(w.BoundingBox)).ToList();
            }

            IReadOnlyList<TextLine> GetLines(IReadOnlyList<Word> words, string wordSeparator, double tolerance=0)
            {
                var lines = new List<TextLine>();
                if (words == null || words.Count == 0) return lines;
                // order words by BoundingBox.Top descending
                var orderedWords = words.Where(w=>!string.IsNullOrEmpty(w.Text)).OrderByDescending(w=>w.BoundingBox.Top).ToList();

                // cluster words if words BoundingBox is overlaping in Y projection
                double lineBottom = 0;
                // use lineWords to collect words as a line cluster
                var lineWords = new List<Word>();
                lineBottom = orderedWords[0].BoundingBox.Bottom;
                lineWords.Add(orderedWords[0]);
                for (int i = 1; i < orderedWords.Count; i++)
                {
                    var bbx = orderedWords[i].BoundingBox;
                        if (bbx.Top >= lineBottom + tolerance)  // user use tolerance to control overlapping status
                        {
                            lineWords.Add(orderedWords[i]);
                            lineBottom = Math.Min(lineBottom, bbx.Bottom);
                        }
                        else
                        {
                            if (lineWords.Count>0 && !string.IsNullOrEmpty(lineWords[0].Text.Trim()))
                                lines.Add(new TextLine(lineWords.OrderByReadingOrder(), wordSeparator));
                            lineWords.Clear();
                            lineWords.Add(orderedWords[i]);
                            lineBottom = bbx.Bottom;
                        }
                }
                if (lineWords.Count > 0 && !string.IsNullOrEmpty(lineWords[0].Text.Trim()))
                    lines.Add(new TextLine(lineWords.OrderByReadingOrder(), wordSeparator));
                return lines;
            }
        }
davebrokit commented 6 days ago

Ah cool. You're checking overlap in Y projection.

It's got some similarity with UnsupervisedReadingOrderDetector. That uses spatial reasoning using Allen’s interval relations to determine a reading order. That is complex to handle more situations and doesn't split into lines.

https://github.com/UglyToad/PdfPig/blob/14e7024545d19dbf2b293f3d8d00ee507869961e/src/UglyToad.PdfPig.DocumentLayoutAnalysis/ReadingOrderDetector/UnsupervisedReadingOrderDetector.cs#L72

I'll have a think about it and get back to you.

p.s. Just for clarity I'm not proposing we change UnsupervisedReadingOrderDetector.

davebrokit commented 4 days ago

@hjy1210 Please review the PR #862

I decided against implementing your solution as it's a more complex implementation and it feels if you need it the tools like Docstrum and UnsupervisedReadingOrderDetector are probably where you should be heading. The tolerance parameter fixes for this case and there is only a few edge cases that it won't cover but it's implementation is very simple and doesn't impact the API. Nothing against your implementation. It's a pretty good implementation.

I have however added a ILineSegmenter interface and a line segmenter function you can pass as part of the options to RecurisveXYCut and DefaultPageSegmenter to control your own segmentation