empira / PDFsharp

PDFsharp and MigraDoc Foundation for .NET 6 and .NET Framework
https://docs.pdfsharp.net/
Other
492 stars 114 forks source link

Deleted page not "really" deleted #141

Closed packdat closed 2 months ago

packdat commented 2 months ago

While working on incremental updates (see #112) and adding support for deleted objects, i encountered a behavior that may or may not be intended.

When i delete a page from a document, it gets removed from the pages-array as expected. When checking the output-file, I observed, that only the page-reference was removed from the pages-array, the page itself and all referenced objects (i.e. content-streams) are still present in the file.

If I understand correctly, the method PdfCrossReferenceTable.Compact() is intended to clean up these objects, is that true ? At least it would clean up (i.e. remove) the page and the objects referenced by that page, if the pages-array were the only place where the page is referenced. But a page could be referenced from multiple locations, some places that come to mind:

In my case, the page, that was not deleted was referenced (at least) by 3 different outlines.

Simple test-case (add it to PdfSharp.Tests.IO.WriterTests):

[Fact]
public void Deleted_Page_Not_Really_Deleted()
{
    var sourceFile = IOUtility.GetAssetsPath("archives/grammar-by-example/GBE/ReferencePDFs/WPF 1.31/Table-Layout.pdf")!;
    var targetFile = Path.Combine(Path.GetTempPath(), "AA-Original.pdf");
    File.Copy(sourceFile, targetFile, true);

    using var fs = File.Open(targetFile, FileMode.Open, FileAccess.Read);
    using var doc = PdfReader.Open(fs, PdfDocumentOpenMode.Modify);
    doc.Pages.RemoveAt(0);

    targetFile = Path.Combine(Path.GetTempPath(), "AA-Deleted.pdf");
    doc.Save(targetFile);
}

Open the file AA-Deleted.pdf and observe, the page and it's contents are still present.

Question: Is this the intended behavior ? Are there other CleanUp-methods I'm not aware of ?

IMHO the methods to remove pages are "high level" methods and the library should take care of the "low level" stuff, including cleaning up after itself to maintain the integrity of the document.

I do understand however, that this might not be an easy issue to solve. In theory, the library has to scan the whole document to find references to deleted pages and then has to decide based on the context (where the reference is found), how to deal with it.

StLange commented 2 months ago

Your observations are correct. The situation is difficult. You can remove a page, add some new pages, and then re-add the removed page again at a different position. The only point in time when an object clean-up makes sense is immediately before saving the document. PDFsharp starts with the catalog table and calculates the transitive closure of all referenced objects. Not referenced objects are removed. This approach works, but is to simple. If e.g. a font and an image are used only in the content stream of a particular page and you delete the page, both the font and the image remains in the resource tables of the document.

As you said, it would be possible to analyze a document on a detailed level and remove all resources that are not used in at least one content stream. Or remove all pages that cannot be reached from the Pages dictionary. Then remove all references to these pages from e.g. annotations etc. But what e.g. we do with an outline entry to a removed page?

The benefit of the clean-up is (only) a smaller PDF file, the visual result keeps the same. If a developer deletes or reorders pages of an existing document with outlines, he has an intention why he do that. Therefore, he must also write code that fixes the outline tree.

Because there are much more important features we can implement, we do not plan to change the current behavior in the near future.

packdat commented 2 months ago

I agree, thanks for the answer ! Glad, I wasn't overlooking something obvious.