empira / PDFsharp

PDFsharp and MigraDoc Foundation for .NET 6 and .NET Framework
https://docs.pdfsharp.net/
Other
492 stars 114 forks source link

Proposal: Incremental saving #112

Open packdat opened 5 months ago

packdat commented 5 months ago

I was recently tasked to evaluate the possibility to "stamp" existing documents. A "stamp" is literally an image of an actual stamp that should be added to specific pages of a document. Problem is, the document may be signed, so the stamp has to be added in a non-destructive manner to keep the signature intact.

I started to hack around and was able to come up with something that seems to work.

The idea was to just track changes to arrays (PdfArray) and dictionaries (PdfDictionary). All other objects are basically immutable so this approach should work in theory. Also, a new PdfDocumentOpenMode was added, namely Append. When a document is opened in this mode, it starts to track changes to arrays and dictionaries. When saving the document, only changed/added objects are saved; the changes are appended to the existing document.

Basic code (taken from the included test-case):

            // necessary to open with ReadWrite access !
            using var fs = File.Open(targetFile, FileMode.Open, FileAccess.ReadWrite);
            var doc = PdfReader.Open(fs, PdfDocumentOpenMode.Append);

            // modify the document, e.g. add content
            var page = doc.Pages[0];
            using var gfx = XGraphics.FromPdfPage(page);
            gfx.DrawString("I was added", new XFont("Arial", 16), new XSolidBrush(XColors.Red), 40, 40);

            // append changes to the document
            doc.Save(fs, true);

There may be more that is needed to work consistently (i.e. i haven't tested with encrypted documents yet as i was told the documents i have to work with will not be encrypted). This change also does not handle the case, where object were deleted from a document. These objects would need to be tracked separately as they would need special entries in the new XREF-table.

One potential issue i spotted was the fact that library modifies the document by just reading certain properties; thus accidentally marking those objects as "modified" although you haven't changed anything. One example are the *Box - properties of PdfPage (e.g. TrimBox, CropBox, ...) If you read these properties and the document does not already contain values for them, a new value is added to the underlying dictionary.

I haven't looked too deeply but i expect there are more cases like that. I have changed the *Box-properties to just return PdfRectangle.Empty when there is no value instead of adding a new value.

There is also the case with type-transformations (e.g. exchanging a PdfDictionary with a more specific type like PdfPage). These transformations happen "under the hood" and would normally also cause objects to be marked as modified. I tried to prevent that by temporarily ignoring changes while doing the type-transformations by using the new method

     PdfCrossReferenceTable.IgnoreModify(Action action)

This is quite "hack-ish", maybe you have better ideas on how to tackle this ?

packdat commented 2 months ago

Deleted objects are now handled when saving incrementally.

Note: Still untested with encrypted documents due to a lack of time.

TM-Atharva commented 1 day ago

Hey @packdat - Can you share multiple pages and multiple sign/stamp on same page, example code ?

OR please guide how to achieve that.