Open hjy1210 opened 3 weeks ago
Looks like it would be reading order detector causing that. That is what orders the text.
Have you tried using reading order detector with a higher T?
var readingOrder = new UnsupervisedReadingOrderDetector(10, UnsupervisedReadingOrderDetector.SpatialReasoningRules.RowWise);
See https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#unsupervised-reading-order-detector
@davebrokit
The code use RecursiveXYCut with MinimumWidth = page.Width, so there is single TextBlock in each page. I guess this issue is noting to do with reading order of TextBlocks.
If, after statement var letters = page.Letters;
add
foreach (var letter in letters)
{
Console.WriteLine($"{letter}");
}
We can get following informations on console:
...
1 (x:63.864, y:672.67) TimesNewRomanPSMT 11.04
. (x:69.624, y:672.67) TimesNewRomanPSMT 11.04
(x:72.744, y:672.67) TimesNewRomanPSMT 11.04
新 (x:82.344, y:673.15) BCDHEE+PMingLiU 11.04
北 (x:93.84767999999998, y:673.15) BCDHEE+PMingLiU 11.04
市 (x:105.35135999999999, y:673.15) BCDHEE+PMingLiU 11.04
...
we can see the y coordinate of label 1
is 672.67, but the y coordinate of first letter of item content 新
is 673.15.
Because 672.67 is 0.48 points lower than 673.15, so the TextLine of 1.
is followed after TextLine of 新北市...
when XYLeaf.GetLines collect lines.
I guess may be this issue is caused by different fonts:
the font of 1.
is TimesNewRomanPSMT, the font of 新北市...
is BCDHEE+PMingLiU
That is why I use the title XYLeaf.GetLines collect lines not robust enough
Ah understood. Your code will only produce one block.
It's not the responsibilty of RecursiveXYCut to interpret the reading order of the text. It reads text in effectively a rendering order so depending on how your pdf is rendered it's results will vary. Aka it's not reliable for determing text order.
However, it is the responsibilty of Reading Order Detector to interpert the reading order of the text. Unfortunately it can't do that when there is just one block for the page.
What you need to do is chop up the page into smaller text blocks and then use reading order detector to order the blocks correctly. The code you have here:
// 3. Postprocessing
var readingOrder = UnsupervisedReadingOrderDetector.Instance;
var orderedTextBlocks = readingOrder.Get(textBlocks);
Example of choping up into smaller blocks:
var recursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
{
DominantFontWidthFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Width) * 2,
DominantFontHeightFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Height)/2
});
Result
You're doing Document layout analysis. The wiki here has a good guide about it: https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis
Please note there was a fix to UnsupervisedReadingOrderDetector recently so you need to use the nightly build
see #842
@davebrokit
Even use your options to choping into small text blocks as below
Now the item label
1.
and first line of item conetent 新北市...
are in the 9-th block of page 1. the extracted text is STILL as follows:
新北市金瓜石的國際終戰和平紀念園區,立有一面紀念牆,上面刻有第二次世界大戰時拘
1.
留在此的盟軍戰俘姓名與國籍。這些戰俘中,許多是紐西蘭及加拿大軍人。他們的國家並
未遭到日軍攻擊,卻仍參戰,其原因為何?
Note: The label 1.
is still followed the first line of item content 新北市...
.
PS: I use PdfPig(0.1.9-alpha-20240612-d2cae)
I think UnsupervisedReadingOrderDetector is about order between-textblocks, but this issue is about within-textblocks.
Try cutting into smaller blocks so the 1.
is in a block by itself.
This might work: DominantFontWidthFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Width)
or try l.GlyphRectangle.Width/2
var recursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
{
DominantFontWidthFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Width),
DominantFontHeightFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Height)/2
});
@davebrokit
cutting into smaller blocks so the 1. is in a block by itself.
will make full page with too many small textblocks. Every text in Textblock ends with newline, causing 1.
and 新北市...` NOT in the same line, then the shape of extracted text in one page VERY not Similar to the original shape of pdf. I want keep the shape of extracted text similar to the original pdf.
@BobLd Please chime in on this discussion. He wrote the RecursiveXYCut
code
Your use of RecursiveXYCut
was confusing the issue for me. As you are only getting one block it doesn't really make sense to use it in your example.
Here is the mimimum code I got to get your example working. RecursiveXYCut
is not used and I do GroupBy(x => (int) x.BoundingBox.Bottom/5)
which works in this case but is not very robust around the boundaries of the groups (aka you can still have your problem where there is a 0.001 difference between 2 coordinates and they fall into two different buckets)
var letters = page.Letters; // no preprocessing
// 1. Extract words
var wordExtractor = NearestNeighbourWordExtractor.Instance;
var words = wordExtractor.GetWords(letters);
// 2. Segment page into lines
var lines = words.GroupBy(x => (int) x.BoundingBox.Bottom/5)
.Select(x => new TextLine(x.OrderByReadingOrder()))
.OrderByReadingOrder();
var cleanedTextBlocks = new List<TextBlock>() { new TextBlock(lines) };
// 3. Postprocessing. Not sure it's necessary
var orderedTextBlocks = readingOrder.Get(cleanedTextBlocks);
RecursiveXYCut
is a general algorithm designed to cut the page into blocks. Order within the block is not guaranteed. A naive and fast implementation of ordering into lines within a Block has been implemented for our convience. This works with the general case.
words.GroupBy(x => (int) x.BoundingBox.Bottom/tolerance)
where tolerance
is a parameter passed through. But you still have issues with lines where the points used fall on the boundaries and end up in two different buckets.
RecursiveXYCut
on the blocks to cut into single lines. Then reordering the words within the line. But at which point your using RecursiveXYCut
on the results of a RecursiveXYCut
. I would argue that this is too robust and inefficent for the general use casevar letters = page.Letters; // no preprocessing
// 1. Extract words
var wordExtractor = NearestNeighbourWordExtractor.Instance;
var words = wordExtractor.GetWords(letters);
// 2. Segment page into blocks
var pageRecursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
{
MinimumWidth = page.Width
});
var textBlocks = pageRecursiveXYCut.GetBlocks(words);
// 2. Segment blocks into lines
var lineRecursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
{
DominantFontHeightFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Height) / 6,
DominantFontWidthFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Width) * 60,
//MinimumWidth = page.Width - This does work in this case. We have to use Dominant Font Width to handle it
});
var cleanedTextBlocks = new List<TextBlock>();
foreach (var block in textBlocks)
{
// 2.1 Segmenet into lines
var resLines = new List<TextLine>();
var wordsInBlock = block.TextLines.SelectMany(x => x.Words);
var lines = lineRecursiveXYCut.GetBlocks(wordsInBlock);
// 2.2 Reorder text in lines
foreach (var line in lines)
{
var blockWords = line.TextLines.SelectMany(x => x.Words).ToArray();
var orderedWordsInLine = new TextLine(blockWords.OrderByReadingOrder());
resLines.Add(orderedWordsInLine);
}
var newBlock = new TextBlock(resLines.OrderByReadingOrder());
cleanedTextBlocks.Add(newBlock);
}
// 3. Postprocessing
var orderedTextBlocks = readingOrder.Get(cleanedTextBlocks);
Please let me know if you have other solutions to the problem
I would argue that there is value in doing
words.GroupBy(x => (int) x.BoundingBox.Bottom/tolerance)
where tolerance
is a parameter passed through from RecursiveXYCutOptions
If @BobLd is in agreement do you fancy raising a PR for that :). You would be a legend!
Please let me know if I've missed something or you have a solution to the problem :)
p.s. We might also need more general/user friendly functions such as a line segmenter. The library has the building blocks but right now you need to stitch them together yourself when it falls outside of the general cases.
Yes, BobLd's solution "A robust implementation would require using RecursiveXYCut on the blocks to cut into single lines. Then reordering the words within the line.
" is what I want.
I don't think that's a good solution due to it's inefficency and you're effectively using RecursiveXYCut on it's self. It's also something you can implement yourself for your specific case but it would be overkill for most people in most cases.
But we'll wait for wiser heads like @BobLd to chime in :)
@davebrokit I agree with you, RecursiveXYCut
should not be used on itself and should only do the segmentation at block level as intented.
In order to fix the issue raised here, I think we need to give the possibility to end users to provide their own OrderByReadingOrder()
- which I think is the culprit here (since even when the line is correctly build, the digit is not ordered correctly).
I did not allow for that when first wrote these classes as to keep to development simple - but now is a good time to think about how to expend.
On a different note, adding an ILineSegmenter
class might not be as easy as you'd think because some algos like Docstrum
are bottom up, and other like RecursiveXYCut
are top down. But more than happy to discuss that on a different issue if you have ideas you'd wnat to share
@BobLd Thanks for the response
I agree with the need to give the possibility to end users to provide their own OrderByReadingOrder()
but that would not fix the issue raised here as the problem comes from the line segmenter words.GroupBy(x => x.BoundingBox.Bottom)
. In this issue BoundingBox.Bottom
of the 1.
and the rest of the words on the line have a slight difference which means they are treated as 2 different lines.
And totally agree that adding a general ILineSegmenter
is not easy. I've tried :). What works tends to be specific to the pdf document.
What I propose we do is:
Allow users to pass their own reading order detector as you suggested
Add a int line segmentation tolerance
parameter to RecursiveXYCutOptions
. When provided we use this line segmenter instead: words.GroupBy(x => (int) x.BoundingBox.Bottom/tolerance)
(Note the use of integer rounding to allow minor variances in BoundingBox.Bottom
to be ignored. It won't break existing code and tests as it only gets used when specified).
Allow users to pass their own line segmenter function to use (would default to 1., this allows you to override it if you need more control). Func<IEnumerable<Word>, IEnumerable<TextLine>>
. This is optional so happy to leave out if you don't agree.
I can make those code changes if you're happy with that
Instead of words.GroupBy(x => x.BoundingBox.Bottom)
, may be we can
criterion:range length of BoundingBox.Bottom of word in group <= T
, where T is a nonnegative tolerance parameter.Note: T=0 is the original implementation
@hjy1210 Thanks for that suggestion.
Step 2. is actually hard to implement so if you have code that can do that please share. You effectively need to cluster together words and that is what RecursiveXYCut
and Docstrum
algorithms try to solve :).
@davebrokit happy for you to create a PR for
Add a int line segmentation tolerance parameter to RecursiveXYCutOptions.
@davebrokit
I use IReadOnlyList<TextLine> GetLines(IReadOnlyList<Word> words, string wordSeparator, double tolerance=0)
to construct lines in TextBlock to demonstrate how to modify XYLeaf's GetLines. Because TextBlock have no property Words
as XYLeaf, I use IReadOnlyList<Word> GetWordsInBlock(TextBlock block, IEnumerable<Word> words)
to get words contained in TextBlock.
Following code is my demo:
[Test]
public void TestPigTextLineIssue6()
{
/// https://github.com/UglyToad/PdfPig/issues/844
string sourcePdfPath = "history-2-3.pdf";
string outputPdfPath = "history-2-3-pigline6boxed.pdf";
string outputTextPath = "history-2-3-pigline6boxed.txt";
using (var document = PdfDocument.Open(sourcePdfPath))
{
var builder = new PdfDocumentBuilder { };
PdfDocumentBuilder.AddedFont font = builder.AddStandard14Font(Standard14Font.Helvetica);
StringBuilder sb = new StringBuilder();
for (var pageNumber = 1; pageNumber <= document.NumberOfPages; pageNumber++)
{
var pageBuilder = builder.AddPage(document, pageNumber);
pageBuilder.SetStrokeColor(128, 128, 128);
var page = document.GetPage(pageNumber);
var letters = page.Letters; // no preprocessing
// 1. Extract words
var wordExtractor = NearestNeighbourWordExtractor.Instance;
var words = wordExtractor.GetWords(letters); // words may contains word with single space
foreach (var word in words)
{
pageBuilder.DrawRectangle(word.BoundingBox.BottomLeft, word.BoundingBox.Width, word.BoundingBox.Height);
}
// 2. Segment page
var recursiveXYCut = new RecursiveXYCut(new RecursiveXYCut.RecursiveXYCutOptions()
{
MinimumWidth = page.Width
});
var textBlocks = recursiveXYCut.GetBlocks(words);
// 3. Postprocessing
var readingOrder = UnsupervisedReadingOrderDetector.Instance;
var orderedTextBlocks = readingOrder.Get(textBlocks);
// 4. Add debug info - Bounding boxes and get text from textblock
foreach (var block in orderedTextBlocks)
{
var bbox = block.BoundingBox;
pageBuilder.DrawRectangle(bbox.BottomLeft, bbox.Width, bbox.Height);
pageBuilder.AddText($"{pageNumber}-{block.ReadingOrder.ToString()}", 8, bbox.TopLeft, font);
// following two line are the main code
var ws = GetWordsInBlock(block, words);
var lines = GetLines(ws, " ");
foreach (var line in lines)
{
sb.AppendLine(line.Text);
}
}
}
// 5. Write result to a file
byte[] fileBytes = builder.Build();
if (!string.IsNullOrEmpty(outputPdfPath))
File.WriteAllBytes(outputPdfPath, fileBytes); // save to file
if (!string.IsNullOrEmpty(outputTextPath))
File.WriteAllText(outputTextPath, sb.ToString());
}
IReadOnlyList<Word> GetWordsInBlock(TextBlock block, IEnumerable<Word> words)
{
var rect = block.BoundingBox;
// enlarge rect to prevent mathmatical computing rounding error problem
var bbx = new PdfRectangle(rect.Left - 1, rect.Bottom - 1, rect.Right + 1, rect.Top + 1);
return words.Where(w=>bbx.Contains(w.BoundingBox)).ToList();
}
IReadOnlyList<TextLine> GetLines(IReadOnlyList<Word> words, string wordSeparator, double tolerance=0)
{
var lines = new List<TextLine>();
if (words == null || words.Count == 0) return lines;
// order words by BoundingBox.Top descending
var orderedWords = words.Where(w=>!string.IsNullOrEmpty(w.Text)).OrderByDescending(w=>w.BoundingBox.Top).ToList();
// cluster words if words BoundingBox is overlaping in Y projection
double lineBottom = 0;
// use lineWords to collect words as a line cluster
var lineWords = new List<Word>();
lineBottom = orderedWords[0].BoundingBox.Bottom;
lineWords.Add(orderedWords[0]);
for (int i = 1; i < orderedWords.Count; i++)
{
var bbx = orderedWords[i].BoundingBox;
if (bbx.Top >= lineBottom + tolerance) // user use tolerance to control overlapping status
{
lineWords.Add(orderedWords[i]);
lineBottom = Math.Min(lineBottom, bbx.Bottom);
}
else
{
if (lineWords.Count>0 && !string.IsNullOrEmpty(lineWords[0].Text.Trim()))
lines.Add(new TextLine(lineWords.OrderByReadingOrder(), wordSeparator));
lineWords.Clear();
lineWords.Add(orderedWords[i]);
lineBottom = bbx.Bottom;
}
}
if (lineWords.Count > 0 && !string.IsNullOrEmpty(lineWords[0].Text.Trim()))
lines.Add(new TextLine(lineWords.OrderByReadingOrder(), wordSeparator));
return lines;
}
}
Ah cool. You're checking overlap in Y projection.
It's got some similarity with UnsupervisedReadingOrderDetector
. That uses spatial reasoning using Allen’s interval relations to determine a reading order. That is complex to handle more situations and doesn't split into lines.
I'll have a think about it and get back to you.
p.s. Just for clarity I'm not proposing we change UnsupervisedReadingOrderDetector
.
@hjy1210 Please review the PR #862
I decided against implementing your solution as it's a more complex implementation and it feels if you need it the tools like Docstrum
and UnsupervisedReadingOrderDetector
are probably where you should be heading. The tolerance parameter fixes for this case and there is only a few edge cases that it won't cover but it's implementation is very simple and doesn't impact the API. Nothing against your implementation. It's a pretty good implementation.
I have however added a ILineSegmenter
interface and a line segmenter function you can pass as part of the options to RecurisveXYCut
and DefaultPageSegmenter
to control your own segmentation
I tryed to extract text from history-2-3.pdf.
Part image of the history-2-3.pdf is
It is a question item with item label "1." , item label followed by question body "新北市金瓜石的國際終戰和平紀念園區...".
But the extracted text are:
Note: the item label
1.
is followed the first line新北市金瓜石的國際終戰和平紀念園區,立有一面紀念牆,上面刻有第二次世界大戰時拘
of the question content. This phenomenon will cause trouble when searching text.Following is my code:
I guess this issue is due to:
XYLeaf.GetLines
useWords.Groupy(x => x.BoundingBox.Bottom)
is not robust if words baselines/bottoms have some tiny differences.