UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.65k stars 232 forks source link

When a get textblock from a PDF vary depending on the operating system #840

Open fpisarello-dawa opened 3 months ago

fpisarello-dawa commented 3 months ago

Hi, i have some problem when i get TextArea from this code:

When i run this script in Windows 10 Platform - LinqPad:

using (var document = PdfDocument.Open(FACTURA_AFIP.pdf))
{
for (var i = 1; i <= document.NumberOfPages; i += 4)
{
    // For PDF coordinates the y-axis runs from the bottom of the page up
    var bottomLeft = new PdfPoint(479, 149);
    var topRight = new PdfPoint(559, 159);
    var square = new PdfRectangle(bottomLeft, topRight);
    var page = document.GetPage(i);
    var letters = page.Letters.Where(x => square.IntersectsWith(x.GlyphRectangle)).ToList();

    var wordsInRegion = DefaultWordExtractor.Instance.GetWords(letters);
    var textInRegion = string.Join(" ", wordsInRegion.Select(x => x.Text).ToList());

    textInRegion.Dump();

}
}

Result: 72515176735833

but the same Script in Linux Ubuntu 20.04 - dotnet-script: N°: 72515176735833

Why do Windows and Linux show different results?

Upload the PDF file form more detail. FACTURA_AFIP.pdf

BobLd commented 3 months ago

@fpisarello-dawa I'm guessing this comes from different default fonts being used on different operating systems. I'd expect the fonts in your documents are not embedded, and PdfPig uses the OS ones to get the bounding boxes. These will differ by OS.

@EliotJones this is not the first time we have this kind of question. I think we should try to ship default fonts like other pdf readers do, so that pdfpig always use the sames ones.

Doing so will also make easier to write units tests across different OS, as people will expect consistency across. Let me know what you think

Also see https://askubuntu.com/questions/599915/what-is-the-closest-font-to-helvetica-available-on-ubuntu

And https://stackoverflow.com/questions/6383511/font-metrics-for-the-base-14-fonts-in-the-pdf-specification#6506818

EliotJones commented 3 months ago

@BobLd it's a reasonable suggestion, I'm just not sure what the licensing situation for that looks like. I'd expect you need some kind of payment to redistribute most fonts from foundries.

BobLd commented 3 months ago

@EliotJones you nailed the main issue with fonts... I'll revert back with fonts that have a compatible license with the project. Let's see then what's doable

BobLd commented 3 months ago

Looking at the table below, we have open source equivalents (table from https://wiki.archlinux.org/title/Metric-compatible_fonts) image

Liberation fonts are available under SIL OPEN FONT LICENSE Version 1.1, which is from what I understand as open source as you can get for a font, see here https://github.com/liberationfonts/liberation-fonts/tree/main/src

Using Liberation fonts, we cover 12 out of the 14 Base fonts (we are missing Symbol and ZapfDingbats) - I'll look into the rest (also, they are already referenced in the SystemFontFinder class)

BobLd commented 3 months ago

Symbol font: https://github.com/powerline/fonts/tree/master/SymbolNeu (Apache License, Version 2.0)

fpisarello-dawa commented 2 months ago

@BobLd thank for response. I installed font into Linux server (Helvetica) and i had the same behavior. I need to install another font into a server to make the same response?