UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.72k stars 241 forks source link

PdfDocumentBuilder creates broken annotation elements when copying pages from specific source PDFs #878

Closed cremor closed 1 month ago

cremor commented 2 months ago

I have an application that uses PdfPig to merge multiple PDF files. I'm using PdfDocumentBuilder.AddPage for this. The users of the application have reported a few cases where the created (merged) PDF has missing elements (mostly annotations).

Depending on the PDF viewer application the elements are either:

Sample input file: Input.pdf

Sample code:

string inputFile = @"C:\Data\Input.pdf";
string outputFile = @"C:\Data\Output.pdf";

using var targetStream = File.Open(outputFile, FileMode.Create, FileAccess.Write);
using var outputDocument = new PdfDocumentBuilder(targetStream);
using var inputDocument = PdfDocument.Open(inputFile);

for (int i = 1; i <= inputDocument.NumberOfPages; i++)
{
    outputDocument.AddPage(inputDocument, i);
}

I've tested multiple versions of PdfPig and here are the results:

Input: grafik

Output when shown in Acrobat Reader: grafik

Ouput when shown in Microsoft Edge: grafik

cremor commented 2 months ago

@BobLd I saw that you provided a few fixes for the UglyToad.PdfPig.Writer namespace in the past. Could you please take a look at this issue? Is there anything else I can provide?

cremor commented 1 month ago

@BobLd @EliotJones I really hate to nag you again with this, but my client is asking repeatedly for a solution. Do you have any insight in this issue?

I'd debug it myself, but since I've never worked with the PdfPig source code before and I also have no idea about the PDF spec I'm lost on where to start. If you don't have time to look into this, do you maybe instead have a pointer for me in which classes I could check for changes that happened after 0.1.8 which could have broken this?

BobLd commented 1 month ago

@cremor I understand the frustration here but only have limited time in the coming weeks to work on that.

I did a quick investigation and I think I found the root cause (you might want to continue the investigation to understand where in the code the fix needs to be done).

I manually fixed the document by replacing the scientific notation number 1E-08 by 0.00000001 - see diff below using WinMerge: image

The fix should be located where the matrix is transformed into text

<</Type /XObject /Subtype /Form /BBox [ 636.334961 397.849976 680.334961 441.849976 ] /Resources <<>>/Matrix [ **1E-08** -1 1 **1E-08** 238.484985 1078.18494 ] /Length 277 /Filter /FlateDecode >>

cremor commented 1 month ago

@BobLd Thanks! I think this might have been broken by your PR #766 which changed some decimal values to double. I think that PR might be the cause because - according to the documentation - the "G" format specifier uses:

Fixed-point notation is used if the exponent that would result from expressing the number in scientific notation is greater than -5 and less than the precision specifier; otherwise, scientific notation is used. [...] However, if the number is a Decimal and the precision specifier is omitted, fixed-point notation is always used and trailing zeros are preserved.

I thought that the problem is the method OperationWriteHelper.WriteDouble because writing those former decimal, now double, values uses (used, see below) the "G" format specifier. But when I modify OperationWriteHelper.WriteDouble to cast value to decimal before formatting it with the old code then the bug still happens. Isn't this the place where this number is written to the stream?

The current code in OperationWriteHelper.WriteDouble doesn't use the "G" format specifier directly any more, but uses Utf8Formatter.TryFormat instead. There is an overload that takes a StandardFormat, but with "F" that also needs a precision to actually write all needed decimal places. And then it also needs a bigger buffer. I don't know what the maximum could be for both precision and buffer length.

Btw, a similar bug might happen in SetFontAndSize.ToString because your PR also changed Size in that class from decimal to double and it also uses the "G" format specifier. There might be more possible bugs in changed code, but that's one I noticed.

BobLd commented 1 month ago

@cremor thanks a lot for the feedback. I'm trying to understand where the change exactly, and it's not very obvious. I tried StandardFormat with 'G' and it still returns "1E-08"

I'll continue to have a look but feel free to give more feedback

EDIT: issue seems to come from Utf8Formatter.TryFormat(...) rather than the switch from decimal to double

EDIT2: https://stackoverflow.com/questions/66309150/how-do-i-write-a-float-like-82-0-with-the-0-intact-using-utf8jsonwriter

EDIT3: It seem in fact decimal to double is the reason, using Utf8Formatter.TryFormat(Convert.ToDecimal(value), buffer, out int bytesWritten); works

BobLd commented 1 month ago

@cremor if you want to have a first look, https://github.com/BobLd/PdfPig/tree/issue-878

cremor commented 1 month ago

@BobLd It looks like this fixes the issue, thanks!

A small comment about the code: I don't know if it matters, but 512 bytes for the span seems a bit much for normal use cases. The documentation says "If the method fails, iteratively increase the size of the buffer and retry until it succeeds.". Maybe start with the old value of 32 (then the comment would also still be correct) and only increase if necessary?

BobLd commented 1 month ago

@cremor that's great! Yes 512 is too big for what we do indeed. Can you share where in the documentation they advise that?

cremor commented 1 month ago

At the bottom of the remarks section of https://learn.microsoft.com/en-us/dotnet/api/system.buffers.text.utf8formatter.tryformat?view=net-8.0#system-buffers-text-utf8formatter-tryformat(system-double-system-span((system-byte))-system-int32@-system-buffers-standardformat)

cremor commented 1 month ago

@BobLd I really like your fix in https://github.com/BobLd/PdfPig/commit/1f0224c2affaf84d1b0014a8384412c207cba76a!

About the remaining TODO: How about falling back to the "slow" version of stream.WriteText(value.ToString("F17", CultureInfo.InvariantCulture));?

BobLd commented 1 month ago

@cremor thanks a lot! 100% agreed on the TODO, this is what I'm going to add.

Issue with my fix is that other tests now fail... we have too much precision compared to what we had before. I need to look into that