UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.73k stars 241 forks source link

Results of the textblock extracted from a PDF vary depending on the operating system. #694

Open ggaebee opened 1 year ago

ggaebee commented 1 year ago

I built a project that includes the following source code in the Windows .NET 7.0 environment.

using (PdfDocument doc = PdfDocument.Open(bytes))
{
         IEnumerable<Page> pages = doc.GetPages();
         for (int pageNo = StartIndex > 1 ? StartIndex : 1; pageNo <= doc.NumberOfPages; pageNo++)
         {
               Page page = doc.GetPage(pageNo);
               IEnumerable<Word> words = page.GetWords();
               RecursiveXYCut.RecursiveXYCutOptions recursiveXYOpt = new RecursiveXYCut.RecursiveXYCutOptions();
               RecursiveXYCut recursiveXYCut = new RecursiveXYCut(recursiveXYOpt);
               IReadOnlyList<TextBlock> textBlocks = recursiveXYCut.GetBlocks(words);

               foreach (TextBlock textBlock in textBlocks)
               {
                    TextBlock2Json(textBlock);
                }                       
         }
}    

Also, I built it for the linux-x64 environment using the command:

Command
> dotnet publish -r linux-x64

Here is the version information.

PdfPig version : 0.1.8
Windows .NET SDK version : 7.0
Linux .NET SDK version : 7.0

I tested with the same PDF file as input on both OS and checked the results.

Windows Results
...
{
  "PAGE": 1,
  "SENTENCE": "Page 1 of 26",
  "WIDTH": 5.144824218749989,
  "HEIGHT": 6.753515625000006
}
...
Linux Results
...
{
  "PAGE": 1,
  "SENTENCE": "Page 1 of 26",
  "WIDTH": 26.193600000000004,
  "HEIGHT": 9.271799999999999
}
...

The WIDTH is textBlock.TextLines.First().Words.First().Letters.First().GlyphRectangle.Width and the HEIGHT is textBlock.TextLines.First().Words.First().Letters.First().GlyphRectangle.Height.

Regardless of which PDF file is input, it shows different results. Why do Windows and Linux show different results?

BobLd commented 1 year ago

@ggaebee can you test with the latest pre-release package and check the issue is still there?

686 might have changed the behaviour

ggaebee commented 1 year ago

@BobLd After testing with the latest pre-release package, the issue of extreme value discrepancies has been resolved, but there are still differences in the values. Could you check on this issue? image

Windows Results
{
  "PAGE": 1,
  "SENTENCE": "Page 1 of 26",
  "WIDTH": 5.144824218749989,
  "HEIGHT": 6.753515625000006
}
Linux Results
{
  "PAGE": 1,
  "SENTENCE": "Page 1 of 26",
  "WIDTH": 4.970507812499989,
  "HEIGHT": 6.678808593750006
}