galkahana / HummusJS

Node.js module for high performance creation, modification and parsing of PDF files and streams
http://www.pdfhummus.com
Other
1.15k stars 170 forks source link

Extract or get information about images in the PDF #222

Open peter-borgstedt opened 6 years ago

peter-borgstedt commented 6 years ago

Is this possible? Do I need to implement the logic for this myself? Maybe with the parser? Can I retrieve image streams and rebuild these? Is there any explanation or tips on how to go about this?

What kind of objects are images when parsed? How do I identify these?

peter-borgstedt commented 6 years ago

https://github.com/galkahana/HummusJS/wiki/Parsing

Following was explained. "... if you are looking for querying the images, or the text or these kind of high level methods, they are not there. You can build them using the parser, and it will take you some of the way (for instance, you can get a good decoded content stream reader, and you can get easy access and simple object read for dictionaries, strings, numbers etc), but you'll have to understand some PDF."

I have created a parsing where I get all the dictionaries for images. I can see the resolution for these, so thats now solved.

However, how do I do that last part. Getting the content stream of the actual image data?

I'm interested in the color pattern of each images. I want to be sure there are no transparent images in the PDFS, as where I'm going to use them PDF's with elements that has transparent background is not supported.

Any idea how I get that information? Do I need the image data for these and recreating the objects to actually analyse the image (maybe with another library)?

peter-borgstedt commented 6 years ago

Got the possibility to see the ICCBased and then possible to get the amount of components. They were all 3 in my case, so RGB. 1 is /devicegray and 4 is CMYK its seems. https://blog.idrsolutions.com/2011/04/understanding-the-pdf-file-format-%E2%80%93-iccbased-colorspaces/

peter-borgstedt commented 6 years ago

Still not possible to find out if an image has transparency.

peter-borgstedt commented 6 years ago

And how to I get the actual image data. And is it possible to know what kind of format the image is? Or is it converted into som other kind of format?

surendra-y commented 6 years ago

The image is saved in a different Object (also called XObject ) , which stores the binary of image. Which is not Exactly our JPEG or GIF You can extract the image from the PDF. If the image was stored in raw format and is downscale to display then probably you can scale it up to get the .higher quality image.

But you cannot get the metadata like file name, file format, creation date etc. of image inserted.

For more reference:

I hope it helps.