Closed ajrcarey closed 1 year ago
Added PdfPages::delete_page_at_index()
to deprecated items list as part of #36.
Added PdfPages::delete_page_range()
to deprecated items list as part of #36.
Actually, if PdfDocument::pages()
were to return a &PdfPages
/ &mut PdfPages
reference (rather than the owned PdfPages
instance it returns at the moment), and if PdfPages::get()
were to return a &PdfPage
/ &mut PdfPage
reference (rather than the owned PdfPage
instance it returns at the moment), then it might be safe to reinstate these functions, as Rust would be able to manage the lifetimes safely. This could be done as part of #47. For now, though, deprecating the functions is the safe move.
Deprecated functions in PdfPages
. Added PdfPage::delete()
function, taking advantage of new PdfPageIndexCache
introduced in #60. Removed references to deprecated functions from examples/concat.rs
and examples/copy_objects.rs
. Removed reference to deprecated function from PdfPageIndexCache
unit tests. Updated documentation. Package for inclusion in release 0.7.30.
As PDFium only provides a document-level page deleter, how did you implement this?
How can you be sure to get the right index for a PdfPage
?
Page indices of each FPDF_PAGE
in each open FPDF_DOCUMENT
are cached as the page references are instantiated by Pdfium, either when a caller retrieves a page or creates a new page. This cache of page indices is used from within PdfPage::delete()
. The implementation is at https://github.com/ajrcarey/pdfium-render/blob/master/src/page_index_cache.rs.
I see. We can't do this in pypdfium2, as we also expose the raw pdfium API, so the caller might do operations that change page indices without helper classes knowing.
In general, I would find it safer if pdfium could just provide something like your get_index_for_page()
natively to reduce the risk of a page index tracker getting out of sync.
I suspect it would be difficult to manage in a garbage-collected language because the point at which references are dropped is likely non-deterministic, making it difficult to know when to update any cache. Rust's memory model proves advantageous in this case.
Sorry, I don't understand yet. What has this got to do with object references? Apart from that, attaching code to be run on garbage collection of an object is possible in Python.
When a caller calls FPDF_ClosePage()
(or whatever the function is called), any reference to that FPDF_PAGE
object that is still held by any other caller is no longer valid (a "dangling" reference). Garbage collection is not an ideal way to determine when to tidy up these dangling references, because there will be a non-deterministic delay between invalidating the reference and dropping any held copies.
However, like you say, you expose the raw Pdfium API, so the point is somewhat moot as I imagine you probably don't have a way of intercepting when callers are creating and releasing FPDF_* object references.
Yes, the non-deterministic nature of garbage collection is a problem, see this section of pypdfium2's readme.
However, I think we're kind of drifting away from the initial point. What I meant was just that a page object -> index mapping can be slightly dangerous if not implemented natively in PDFium. If there is any change to page order the mapping fails to track, then page.delete()
might remove a different page than intended. It will be OK if your code is correct and you do not expose the raw API, though. (Except that it also adds additional, hidden complexity/expense on top of pdfium's API.)
In pypdfium2 this is something we cannot do, as callers are encouraged to use helpers in conjunction with the raw API, and the raw API may potentially be used to do arbitrary modifications to page order.
There's not much point in attempting to intercept raw API calls, because (a) we want to make the raw API available to the caller as-is (it would be confusing if calling a raw function would implicitly trigger something) (b) new APIs that change page order might be added to pdfium, so a page cache would imply the same risks, and (c) the purpose does not justify the complexity.
If a reference to a
PdfPage
is retrieved usingPdfPages::get()
, then that page is deleted from the document usingPdfPages::delete_page_at_index()
, we still possess a (now invalid)PdfPage
reference. This is effectively a use-after-free violation. Avoid this by removing thePdfPages::delete_page_at_index()
function and instead use aPdfPage::delete()
function that consumesmut self
. This ensures the caller cannot hold an invalid reference to a page after it is deleted.