galkahana / HummusJS

Node.js module for high performance creation, modification and parsing of PDF files and streams
http://www.pdfhummus.com
Other
1.15k stars 170 forks source link

Delete object from pdf #485

Open zzemchik opened 1 year ago

zzemchik commented 1 year ago

Hello! I'm trying to delete an object in a copied PDF file, the object is deleted visually, but still remains in memory. How to completely remove an object from a file? I use: inPDFWriter.GetObjectsContext().StartModifiedIndirectObject(xobjectID); objectsRegistry.DeleteObject(xobjectID);

galkahana commented 1 year ago

removing an object from a file is not really an option with PDF Modifications. You may mark it as deleted with objectsRegistry.DeleteObject(xobjectID); which means that a reader application ignores its content.

I'm fairly sure inPDFWriter.GetObjectsContext().StartModifiedIndirectObject(xobjectID); is not required, and delete is sufficient.

Depending on how exactly you are copying the file you can avoid copying it in the first place or replace (via ReplaceSourceObjects) it with an object that's content is null (just create a new object, make its content a null pdf keyword and finish it, now you got a null object).

zzemchik commented 1 year ago

Oh, really hard. How can I copy objects step by step while replacing old ones? I'm trying to do it something like this:

inPDFWriter.StartPDF("/home/ivan/pdf/test_2_image_modyfy.pdf", ePDFVersion14); std::shared_ptr = copyingContext(inPDFWriter.CreatePDFCopyingContext("/home/ivan/pdf/mini_pdf.pdf")); //the file I'm trying to copy

copyingContext->CopyNewObjectsForDirectObject(objectIDTypeList); // let's imagine that I created objectIDTypeList inPDFWriter.EndPDF(); And when I do this, my PDF is always broken, some objects inside are not completed

In general, my task is to copy a PDF file with the substitution of some objects (pictures), how can I do this? Sorry to waste your time, I'm just a little short on documentation...

galkahana commented 1 year ago

Its alright, sorry for the doc being short. OH and now i realize this actually belongs in PDFWriter...right? the code is C++. so let me at least answer using the C++ names.

ok, so for general copying of a pdf but replacing some of its object you would want a combination of copying the full pdf + using copyingContext.ReplaceSourceObjects for those objects you want to replace.

To copy the full PDF you can use code similar to the one in the library RecryptPDF function, which basically does just that - it copies the full pdf by recursively copying from the root object, using copyingContext->CopyObject (which is recursive), then sets the root of the new PDF to be the copied object. you can find the code here: https://github.com/galkahana/PDF-Writer/blob/master/PDFWriter/PDFWriter.cpp#L846

Now, prior to calling CopyObject you'll want to create the replacement images in the target PDF. Alternatively allocate object IDs from them to be used after copying, if you prefer to do them afterwards. Once you have the new ids, and can collect the images ids of the images you want to replace, call ReplaceSourceObjects with the mapping:

void ReplaceSourceObjects(const ObjectIDTypeToObjectIDTypeMap& inSourceObjectsToNewTargetObjects);

The map keys are the original images IDs (those in the source document). and the values are the target image ids.

So the code should largely look something like this:

// in advance create inPDFWriter with the target file, lets also assume that you took care of creating the new images in the file and that you got their mapping to source ids in `ObjectIDTypeToObjectIDTypeMap sourceImagesToTargetImages`

PDFDocumentCopyingContext* copyingContext = inPDFWriter.CreatePDFCopyingContext("/home/ivan/pdf/mini_pdf.pdf");

// set the replacement map, prior to copying
copyingContext->ReplaceSourceObjects(sourceImagesToTargetImages);

// get its root object ID
PDFObjectCastPtr<PDFIndirectObjectReference> catalogRef(copyingContext->GetSourceDocumentParser()->GetTrailer()->QueryDirectObject("Root"));

// deep-copy the whole pdf through its root - return root object ID copy at new PDF
EStatusCodeAndObjectIDType copyCatalogResult = copyingContext->CopyObject(catalogRef->mObjectID);

// set new root object ID as this document root
pdfWriter.GetDocumentContext().GetTrailerInformation().SetRoot(copyCatalogResult.second);

delete copyingContext;

// you probably want to end the PDF after that...at least given that we set the root object the apis for adding pages and such probably wont function properly. there's more of a lower level treatment in this case.
inPDFWriter.EndPDF();

Note - depending on your overall intent you might want to replace only part of the document, like specific pages. in this case, dont query the original root and set the result root on the target document. rather query the original object (say page) and create a relevant target object (say a page). we can get into this difference if it matters to you.

zzemchik commented 1 year ago

Yes it works as I wanted, thank you very much! I would like to ask a couple more questions, is CreateImageXObjectFromJPGFile the only way to create an image? I tried CreateXObjectFromJPGFile, but I don’t quite understand how to interact with it so that the image is replaced. And the explanation from the note, do you mean the scenario when I need to copy not the entire PDF, but only individual pages? It’s just that in my case I always need to copy the entire PDF. And as I understand it, there will always be recursive copying.

galkahana commented 1 year ago

First on yr second question: if you need to copy the whole PDF don't mind my note :).

As for images, CreateImageXObjectFromJPGFile and CreateImageXObjectFromJPGFile are good choices, where the latter will create a form xobject with the native size of the image, instead of 1X1 image object that you should scale (well..maybe you'll need to scale the form as well). there's similar methods also for png images and tiff images. never took the time to just do a single method for all of those. maybe something to add at some point.

CreateFormXObjectFromWHATEVERFile gets a file path (or stream) and then embeds the image in the file. again, CreateImageXObjectFromWHATEVERFile, if available, will provide a 1X1 image, that you can size, and CreateFormXObjectFromWHATEVERFile uses what size it reads from the page (if you got your own size...maybe fitting to the box of the original image, you may want ot create a wrapper form doing the sizing...talk to me if you want to know how to do this and can find an example/doc).

CreateFormXObjectFromWhateverFile returns a PDFFormXObject pointer which you can use to acquire its id with formXObject->GetObjectID(). That id is what you want to use for the "target" value of the source to target map provided for the later ReplaceSourceObjects.

you should also delete that formXObject object once you are done with it.

b.t.w if using CreateImageXObjectFromJPGFile instead you will get back a PDFImageXObject which you can use its GetImageObjectID to get its uid. you'll probably want to create a from to size it up and then use that form ID in your list.

there's examples on how to use these function in the test files of PDFWriter. For example, here's a test using CreateImageXObjectFromJPGFile

depending on what exactly you end up doing i can provide further help, but lets see first how do you want to approach this.

zzemchik commented 1 year ago

I think I managed to figure it out. Now my program works as I wanted. Thank you very much for your help and for the library)