ajrcarey / pdfium-render

A high-level idiomatic Rust wrapper around Pdfium, the C++ PDF library used by the Google Chromium project.
https://crates.io/crates/pdfium-render
Other
364 stars 59 forks source link

Remove PdfPages::delete_page_at_index() in favour of PdfPage::delete(). #67

Closed ajrcarey closed 1 year ago

ajrcarey commented 1 year ago

If a reference to a PdfPage is retrieved using PdfPages::get(), then that page is deleted from the document using PdfPages::delete_page_at_index(), we still possess a (now invalid) PdfPage reference. This is effectively a use-after-free violation. Avoid this by removing the PdfPages::delete_page_at_index() function and instead use a PdfPage::delete() function that consumes mut self. This ensures the caller cannot hold an invalid reference to a page after it is deleted.

ajrcarey commented 1 year ago

Added PdfPages::delete_page_at_index() to deprecated items list as part of #36.

ajrcarey commented 1 year ago

Added PdfPages::delete_page_range() to deprecated items list as part of #36.

ajrcarey commented 1 year ago

Actually, if PdfDocument::pages() were to return a &PdfPages / &mut PdfPages reference (rather than the owned PdfPages instance it returns at the moment), and if PdfPages::get() were to return a &PdfPage / &mut PdfPage reference (rather than the owned PdfPage instance it returns at the moment), then it might be safe to reinstate these functions, as Rust would be able to manage the lifetimes safely. This could be done as part of #47. For now, though, deprecating the functions is the safe move.

ajrcarey commented 1 year ago

Deprecated functions in PdfPages. Added PdfPage::delete() function, taking advantage of new PdfPageIndexCache introduced in #60. Removed references to deprecated functions from examples/concat.rs and examples/copy_objects.rs. Removed reference to deprecated function from PdfPageIndexCache unit tests. Updated documentation. Package for inclusion in release 0.7.30.

mara004 commented 1 year ago

As PDFium only provides a document-level page deleter, how did you implement this? How can you be sure to get the right index for a PdfPage?

ajrcarey commented 1 year ago

Page indices of each FPDF_PAGE in each open FPDF_DOCUMENT are cached as the page references are instantiated by Pdfium, either when a caller retrieves a page or creates a new page. This cache of page indices is used from within PdfPage::delete(). The implementation is at https://github.com/ajrcarey/pdfium-render/blob/master/src/page_index_cache.rs.

mara004 commented 1 year ago

I see. We can't do this in pypdfium2, as we also expose the raw pdfium API, so the caller might do operations that change page indices without helper classes knowing. In general, I would find it safer if pdfium could just provide something like your get_index_for_page() natively to reduce the risk of a page index tracker getting out of sync.

ajrcarey commented 1 year ago

I suspect it would be difficult to manage in a garbage-collected language because the point at which references are dropped is likely non-deterministic, making it difficult to know when to update any cache. Rust's memory model proves advantageous in this case.

mara004 commented 1 year ago

Sorry, I don't understand yet. What has this got to do with object references? Apart from that, attaching code to be run on garbage collection of an object is possible in Python.

ajrcarey commented 1 year ago

When a caller calls FPDF_ClosePage() (or whatever the function is called), any reference to that FPDF_PAGEobject that is still held by any other caller is no longer valid (a "dangling" reference). Garbage collection is not an ideal way to determine when to tidy up these dangling references, because there will be a non-deterministic delay between invalidating the reference and dropping any held copies.

However, like you say, you expose the raw Pdfium API, so the point is somewhat moot as I imagine you probably don't have a way of intercepting when callers are creating and releasing FPDF_* object references.

mara004 commented 1 year ago

Yes, the non-deterministic nature of garbage collection is a problem, see this section of pypdfium2's readme.

However, I think we're kind of drifting away from the initial point. What I meant was just that a page object -> index mapping can be slightly dangerous if not implemented natively in PDFium. If there is any change to page order the mapping fails to track, then page.delete() might remove a different page than intended. It will be OK if your code is correct and you do not expose the raw API, though. (Except that it also adds additional, hidden complexity/expense on top of pdfium's API.)

In pypdfium2 this is something we cannot do, as callers are encouraged to use helpers in conjunction with the raw API, and the raw API may potentially be used to do arbitrary modifications to page order.

There's not much point in attempting to intercept raw API calls, because (a) we want to make the raw API available to the caller as-is (it would be confusing if calling a raw function would implicitly trigger something) (b) new APIs that change page order might be added to pdfium, so a page cache would imply the same risks, and (c) the purpose does not justify the complexity.