Hopding / pdf-lib

Create and modify PDF documents in any JavaScript environment
https://pdf-lib.js.org
MIT License
6.9k stars 656 forks source link

Extract images from a pdf page #83

Closed totorelmatador closed 5 years ago

totorelmatador commented 5 years ago

Hi everyone ! I am trying to extract all images from a pdf page. I don't know if it is possible, but I would to do something like this website does. I am currently manipulating the pdf as follows : const pdfDoc = PDFDocumentFactory.load('pdf/path'); const pages = pdfDoc.getPages(); const existingPage = pages[0]; Thank you four your answers :)

Hopding commented 5 years ago

Hello @totorelmatador!

There are a couple of ways to go about this. Some more challenging than others. I wrote a Node script that "scans" an existing PDF and finds all the images it contains, and redraws them all on a new page:

redraw-images.zip

You just need to unzip the file and run yarn install (or npm install) and then run node index.js. The script will write the new PDF to modified.pdf. Here's what modified.pdf looks like:

modified.pdf

The script will also log some information about each image in the document, e.g.

Images in PDF:
Name: JfImage0001
  Width: 176
  Height: 157
  Bits Per Component: 1
  Data: Uint8Array(1778)
  Ref: 20 0 R
...
Name: JfImage0036
  Width: 556
  Height: 271
  Bits Per Component: 8
  Data: Uint8Array(461)
  Ref: 58 0 R

Let me know if this is what you're looking for, or if you have any questions!

totorelmatador commented 5 years ago

Hello @Hopding !

Thank you so much for your answer ! This is exactly the kind of thing I was trying to do ! But I still have a question. My final objective is to save these images as separated files. I tried to do so with the following code (added to your file index.js) :

var i = 0;
imagesInDoc.forEach(image => {
  fs.writeFile("./images/out"+i+".png", image.data, 'base64', function(err) {
    console.log(err);
  });
  i+=1;
});

But it doesn't work well. Saved images can't be opened... The funny thing is that the code works for some images. When I try on this document :

existing.pdf

Only one of the two images is saved and ready to be opened. I think that the cause is the transparency of the image, but I would like to know if it is possible to face this issue...

Thank you again for your time !

Hopding commented 5 years ago

@totorelmatador Sorry for taking so long to respond to this. I've been swamped with work and school lately, so I haven't had a lot of time to devote to this. However, I've made some progress on creating an example script that shows how to do this (though there are some limitations). I'll try to post a more detailed response soon.

totorelmatador commented 5 years ago

Thank you a lot for your time @Hopding ! I have observed something. When we add a png image in a pdf file, we use the following function: [imgRef, imgDims] = pdfDoc.embedPNG(PNGimage) The type of imgRef is PDFIndirectReference, the one of imgDims is PNGXObjectFactory, and PNGimage is an image buffer. When we find all the image objects in the PDF we use the following code:

pdfDoc.index.index.forEach((pdfObject, ref) => {
  objectIdx += 1;

  if (!(pdfObject instanceof PDFRawStream)) return;

  const { lookupMaybe } = pdfDoc.index;
  const { dictionary: dict } = pdfObject;

  const subtype = lookupMaybe(dict.getMaybe('Subtype'));
  const width = lookupMaybe(dict.getMaybe('Width'));
  const height = lookupMaybe(dict.getMaybe('Height'));
  const name = lookupMaybe(dict.getMaybe('Name'));
  const bitsPerComponent = lookupMaybe(dict.getMaybe('BitsPerComponent'));

  if (subtype === PDFName.from('Image')) {
      imagesInDoc.push({
      ref,
      name: name ? name.key : `Object${objectIdx}`,
      width: width.number,
      height: height.number,
      bitsPerComponent: bitsPerComponent.number,
      data: pdfObject.content,
    });
  }
});

where every pdfObject is a PDFRawStream object and ref a PDFIndirectReference. It is possible to extract the image buffer associated to the couple pdfObject & ref ?

Hopding commented 5 years ago

Hello @totorelmatador! I've finally gotten some time to finish up my investigation into this.

First off, it is possible to extract all image types from a PDF using pdf-lib. The question is, how much code will you have to write on top of pdf-lib to do this. It turns out, you'll have to write a fair amount of code if you want to handle all possible images in any type of PDF file.

pdf.js is an open source PDF rendering engine maintained by Mozilla. It's all written in JavaScript. So, of course, this library must be able to extract and render all types of images. This makes it a very good reference to see how this might be done using pdf-lib.

In particular, it's PDFImage class is worth looking at. All of this logic would need to be ported over to use pdf-lib in order to handle all possible types of images. This is because the embedded image format outlined in the PDF specification is pretty long and complicated (as are many things in PDF files).


All of that being said, I created a script that extracts the more common image formats from PDF files. Here it is:

extract-images.zip

You just need to unzip the file and run yarn install (or npm install) and then run node index.js existing1.pdf or node index.js existing2.pdf. The script will extract as many embedded images as it can from the PDF into the images/ directory.

Again, this does not extract all possible types of images. Just the more common formats. It could certainly be improved by porting some code from pdf.js.


I did a bit of googling to see if pdf.js has an API to extract images from PDFs. It looks like this may be possible for certain types of images: https://github.com/mozilla/pdf.js/issues/7813 https://github.com/mozilla/pdf.js/issues/7043. But full support doesn't yet seem available.

I think that adding proper support for image extraction would be an interesting feature to implement in pdf-lib. I imagine it would be quite useful to many developers. However, unless somebody from the community decides to work on this, there are several other things I have to work on first. So it'll be awhile before this feature lands in pdf-lib.

totorelmatador commented 5 years ago

It does work perfectly, thank you a lot !

mealCode commented 5 years ago

thanks a lot, this helps me 29/07/2019 :)

danielhanford commented 4 years ago

Thank you so much! Exactly what I needed and worked perfectly. Aces!

mcmspark commented 4 years ago

This does not work on scanned pdf. It results in an "unknown compression method error"

mcmspark commented 4 years ago

The pages are filter = JBIG2Decode

jowo-io commented 3 years ago

@Hopding thanks for taking the time to post such a useful reply and provide the script. I was wondering if there's an easy way to get the x/y position of the image as well as the width/height?

Swapnil-Kunjir commented 2 years ago

My pdf contains images and tables. I need to remove images from all pages of pdf and keep tables as they are and save new document is it possible?

hafsa-dmnt commented 2 years ago

Hi, I tried the solution of this comment :

Hello @totorelmatador! I've finally gotten some time to finish up my investigation into this.

First off, it is possible to extract all image types from a PDF using pdf-lib. The question is, how much code will you have to write on top of pdf-lib to do this. It turns out, you'll have to write a fair amount of code if you want to handle all possible images in any type of PDF file.

pdf.js is an open source PDF rendering engine maintained by Mozilla. It's all written in JavaScript. So, of course, this library must be able to extract and render all types of images. This makes it a very good reference to see how this might be done using pdf-lib.

In particular, it's PDFImage class is worth looking at. All of this logic would need to be ported over to use pdf-lib in order to handle all possible types of images. This is because the embedded image format outlined in the PDF specification is pretty long and complicated (as are many things in PDF files).

All of that being said, I created a script that extracts the more common image formats from PDF files. Here it is:

extract-images.zip

You just need to unzip the file and run yarn install (or npm install) and then run node index.js existing1.pdf or node index.js existing2.pdf. The script will extract as many embedded images as it can from the PDF into the images/ directory.

Again, this does not extract all possible types of images. Just the more common formats. It could certainly be improved by porting some code from pdf.js.

I did a bit of googling to see if pdf.js has an API to extract images from PDFs. It looks like this may be possible for certain types of images: mozilla/pdf.js#7813 mozilla/pdf.js#7043. But full support doesn't yet seem available.

I think that adding proper support for image extraction would be an interesting feature to implement in pdf-lib. I imagine it would be quite useful to many developers. However, unless somebody from the community decides to work on this, there are several other things I have to work on first. So it'll be awhile before this feature lands in pdf-lib.

When saving the png files, I noticed that the alphaLayer used is the image itself and not the real alphaLayer that we get

image

So i changed it, and added image.alphaLayer = smaskimg;

The problem is now that the image doesn't load completely, as if the smask and the image itself had different dimensions. I don't know if someone has encountered this error before ?

Thanks :)

ps : the full image without smask image

the full image when adding smaskimg image

Dragon3DGraff commented 2 years ago

Hi, I tried the solution of this comment :

Hello @totorelmatador! I've finally gotten some time to finish up my investigation into this. First off, it is possible to extract all image types from a PDF using pdf-lib. The question is, how much code will you have to write on top of pdf-lib to do this. It turns out, you'll have to write a fair amount of code if you want to handle all possible images in any type of PDF file. pdf.js is an open source PDF rendering engine maintained by Mozilla. It's all written in JavaScript. So, of course, this library must be able to extract and render all types of images. This makes it a very good reference to see how this might be done using pdf-lib. In particular, it's PDFImage class is worth looking at. All of this logic would need to be ported over to use pdf-lib in order to handle all possible types of images. This is because the embedded image format outlined in the PDF specification is pretty long and complicated (as are many things in PDF files). All of that being said, I created a script that extracts the more common image formats from PDF files. Here it is: extract-images.zip You just need to unzip the file and run yarn install (or npm install) and then run node index.js existing1.pdf or node index.js existing2.pdf. The script will extract as many embedded images as it can from the PDF into the images/ directory. Again, this does not extract all possible types of images. Just the more common formats. It could certainly be improved by porting some code from pdf.js. I did a bit of googling to see if pdf.js has an API to extract images from PDFs. It looks like this may be possible for certain types of images: mozilla/pdf.js#7813 mozilla/pdf.js#7043. But full support doesn't yet seem available. I think that adding proper support for image extraction would be an interesting feature to implement in pdf-lib. I imagine it would be quite useful to many developers. However, unless somebody from the community decides to work on this, there are several other things I have to work on first. So it'll be awhile before this feature lands in pdf-lib.

When saving the png files, I noticed that the alphaLayer used is the image itself and not the real alphaLayer that we get

image

So i changed it, and added image.alphaLayer = smaskimg;

The problem is now that the image doesn't load completely, as if the smask and the image itself had different dimensions. I don't know if someone has encountered this error before ?

Thanks :)

ps : the full image without smask image

the full image when adding smaskimg image

Hi, hafsa110 I solved this problem just removed "- 1 " in savePng function image

yannbertrand commented 2 years ago

Just a small refresh of the proposed code in another context, this one should work as-is in the browser:

<html>
  <head>
    <meta charset="utf-8" />
    <script src="https://unpkg.com/pngjs@6.0.0/browser.js"></script>
    <script src="https://unpkg.com/pdf-lib@1.17.1/dist/pdf-lib.js"></script>
    <script src="https://unpkg.com/pako@2.0.4/dist/pako.js"></script>
  </head>
  <body>
    <input type="file" id="ticket" />
    <div id="images"></div>

    <script>
      const fileInput = document.getElementById('ticket');
      const imagesContainer = document.getElementById('images');
      fileInput.addEventListener('change', async (event) => {
        imagesContainer.innerHTML = '';
        const buffer = await event.target.files[0].arrayBuffer();
        await extractPdfImages(buffer);
      });

      const extractPdfImages = async (pdfBytes) => {
        const pdfDoc = await PDFLib.PDFDocument.load(pdfBytes);
        const enumeratedIndirectObjects =
          pdfDoc.context.enumerateIndirectObjects();
        const imagesInDoc = [];
        let objectIdx = 0;
        enumeratedIndirectObjects.forEach(async ([pdfRef, pdfObject], ref) => {
          objectIdx += 1;

          if (!(pdfObject instanceof PDFLib.PDFRawStream)) return;

          const { dict } = pdfObject;

          const smaskRef = dict.get(PDFLib.PDFName.of('SMask'));
          const colorSpace = dict.get(PDFLib.PDFName.of('ColorSpace'));
          const subtype = dict.get(PDFLib.PDFName.of('Subtype'));
          const width = dict.get(PDFLib.PDFName.of('Width'));
          const height = dict.get(PDFLib.PDFName.of('Height'));
          const name = dict.get(PDFLib.PDFName.of('Name'));
          const bitsPerComponent = dict.get(
            PDFLib.PDFName.of('BitsPerComponent')
          );
          const filter = dict.get(PDFLib.PDFName.of('Filter'));

          if (subtype == PDFLib.PDFName.of('Image')) {
            imagesInDoc.push({
              ref,
              smaskRef,
              colorSpace,
              name: name ? name.key : `Object${objectIdx}`,
              width: width.numberValue,
              height: height.numberValue,
              bitsPerComponent: bitsPerComponent.numberValue,
              data: pdfObject.contents,
              type: filter === PDFLib.PDFName.of('DCTDecode') ? 'jpg' : 'png',
            });
          }
        });

        // Find and mark SMasks as alpha layers
        // Note: doesn't work in all PDFs, I decided to remove it
        // imagesInDoc.forEach((image) => {
        //   if (image.type === 'png' && image.smaskRef) {
        //     const smaskImg = imagesInDoc.find(
        //       ({ ref }) => ref === image.smaskRef
        //     );
        //     smaskImg.isAlphaLayer = true;
        //     image.alphaLayer = image;
        //   }
        // });

        // Log info about the images we found in the PDF
        console.log(`===== ${imagesInDoc.length} Images found in PDF =====`);
        imagesInDoc.forEach((image) => {
          console.log(
            'Name:',
            image.name,
            '\n  Type:',
            image.type,
            '\n  Color Space:',
            image.colorSpace.toString(),
            '\n  Has Alpha Layer?',
            image.alphaLayer ? true : false,
            // '\n  Is Alpha Layer?',
            // image.isAlphaLayer || false,
            '\n  Width:',
            image.width,
            '\n  Height:',
            image.height,
            '\n  Bits Per Component:',
            image.bitsPerComponent,
            '\n  Data:',
            `Uint8Array(${image.data.length})`,
            '\n  Ref:',
            image.ref.toString()
          );
        });

        const PngColorTypes = {
          Grayscale: 0,
          Rgb: 2,
          GrayscaleAlpha: 4,
          RgbAlpha: 6,
        };
        const ComponentsPerPixelOfColorType = {
          [PngColorTypes.Rgb]: 3,
          [PngColorTypes.Grayscale]: 1,
          [PngColorTypes.RgbAlpha]: 4,
          [PngColorTypes.GrayscaleAlpha]: 2,
        };

        const readBitAtOffsetOfByte = (byte, bitOffset) => {
          const bit = (byte >> bitOffset) & 1;
          return bit;
        };

        const readBitAtOffsetOfArray = (uint8Array, bitOffsetWithinArray) => {
          const byteOffset = Math.floor(bitOffsetWithinArray / 8);
          const byte = uint8Array[uint8Array.length - byteOffset];
          const bitOffsetWithinByte = Math.floor(bitOffsetWithinArray % 8);
          return readBitAtOffsetOfByte(byte, bitOffsetWithinByte);
        };

        const savePng = (image) =>
          new Promise((resolve, reject) => {
            const isGrayscale =
              image.colorSpace === PDFLib.PDFName.of('DeviceGray');
            const colorPixels = pako.inflate(image.data);
            const alphaPixels = image.alphaLayer
              ? pako.inflate(image.alphaLayer.data)
              : undefined;

            const colorType =
              isGrayscale && alphaPixels
                ? PngColorTypes.GrayscaleAlpha
                : !isGrayscale && alphaPixels
                ? PngColorTypes.RgbAlpha
                : isGrayscale
                ? PngColorTypes.Grayscale
                : PngColorTypes.Rgb;

            const colorByteSize = 1;
            const width = image.width * colorByteSize;
            const height = image.height * colorByteSize;
            const inputHasAlpha = [
              PngColorTypes.RgbAlpha,
              PngColorTypes.GrayscaleAlpha,
            ].includes(colorType);

            const pngData = new png.PNG({
              width,
              height,
              colorType,
              inputColorType: colorType,
              inputHasAlpha,
            });

            const componentsPerPixel = ComponentsPerPixelOfColorType[colorType];
            pngData.data = new Uint8Array(width * height * componentsPerPixel);

            let colorPixelIdx = 0;
            let pixelIdx = 0;

            while (pixelIdx < pngData.data.length) {
              if (colorType === PngColorTypes.Rgb) {
                pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
                pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
                pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
              } else if (colorType === PngColorTypes.RgbAlpha) {
                pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
                pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
                pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
                pngData.data[pixelIdx++] = alphaPixels[colorPixelIdx - 1];
              } else if (colorType === PngColorTypes.Grayscale) {
                const bit =
                  readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
                    ? 0x00
                    : 0xff;
                pngData.data[pngData.data.length - pixelIdx++] = bit;
              } else if (colorType === PngColorTypes.GrayscaleAlpha) {
                const bit =
                  readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
                    ? 0x00
                    : 0xff;
                pngData.data[pngData.data.length - pixelIdx++] = bit;
                pngData.data[pngData.data.length - pixelIdx++] =
                  alphaPixels[colorPixelIdx - 1];
              } else {
                throw new Error(`Unknown colorType=${colorType}`);
              }
            }

            const buffer = [];
            pngData
              .pack()
              .on('data', (data) => buffer.push(...data))
              .on('end', () => resolve(Uint8Array.from(buffer)))
              .on('error', (err) => reject(err));
          });

        for (const image of imagesInDoc) {
          if (!image.isAlphaLayer) {
            const imageData =
              image.type === 'jpg' ? image.data : await savePng(image);
            const imgElement = document.createElement('img');
            imgElement.setAttribute(
              'src',
              URL.createObjectURL(
                new Blob([imageData], { type: `image/${image.type}` })
              )
            );
            imgElement.setAttribute('width', image.width);
            imgElement.setAttribute('height', image.height);

            imagesContainer.appendChild(imgElement);
          }
        }
      };
    </script>
  </body>
</html>
zivni commented 1 year ago

Is it possible to get the x,y position of the images?

K-R-M commented 1 year ago

In the original extract-image project, this image in existing1.pdf: OriginalExtractImage gets output in triplicate in /images/out21.png, like so: out21 Does anyone know what causes this? I've got the same issue happening when I extract images from a PDF. I have a feeling it's because this code ignores the /Mask operator and the sub-dictionary of the image's ColorSpace that points to the hival (255) and another stream or array (in this case identified as object 37 0 R), like in this image dictionary:

<<
/Type /XObject
/Subtype /Image
/Filter /FlateDecode
/Width 567
/Height 234
/BitsPerComponent 8
/Length 8636
/ColorSpace [ /Indexed /DeviceRGB 255 37 0 R ]
/Mask [ 251 251 ]
>>

Any other ideas? An indexed ColorSpace is described in section 7.6.6.2 of the Acrobat SDK.

Follow-Up I ended up working around this by using the Jimp library to handle the output or any image that use a separate color palette, instead of PNGJS and it works fine.

jappoman commented 11 months ago

Hi, I know this is an old thread but I ran into a similar problem. I'm extracting images from a specific page of the pdf to apply additional exif metadata. Next, I put the image buffers back inside the pdf... Except when I extract them again, the exif metadata is completely gone. I'm sure I applied the metadata correctly because if I try to save the image, the metadata is there. The problem therefore arises from re-insertion into the PDF.

This is my code for putting back the image into the pdf:

const replaceImagesInPdf = async (pdfDoc, currentPage, newImages) => {
  console.log(`Replacing images in page ${currentPage}...`);
  console.time("replaceImagesInPdfForPage" + currentPage);

  for (let newImage of newImages) {
    // Cycling throug the image of the only page in pdf
    const imageData = newImage.data;
    const imageRef = newImage.ref;

    const enumeratedIndirectObjects = pdfDoc.context.enumerateIndirectObjects();
    let objectIdx = 0;
    enumeratedIndirectObjects.forEach(async ([pdfRef, pdfObject], ref) => {
      objectIdx += 1;

      if (!(pdfObject instanceof PDFRawStream)) return;

      const { dict } = pdfObject;
      const subtype = dict.get(PDFName.of("Subtype"));

      if (subtype == PDFName.of("Image") && ref == imageRef) {
        pdfObject.contents = imageData;
      }
    });
  }

  console.log("Replaced images into page " + currentPage + ".");
  console.timeEnd("replaceImagesInPdfForPage" + currentPage);

  return pdfDoc;
};

This is how i extract the image from the pdf:

const indexPDFImages = async (pdfDoc) => {
  const enumeratedIndirectObjects = pdfDoc.context.enumerateIndirectObjects();
  const imagesInDoc = [];
  let objectIdx = 0;

  enumeratedIndirectObjects.forEach(async ([pdfRef, pdfObject], ref) => {
    objectIdx += 1;

    if (!(pdfObject instanceof PDFRawStream)) return;

    const { dict } = pdfObject;

    const subtype = dict.get(PDFName.of("Subtype"));
    if (subtype !== PDFName.of("Image")) return; // If it's not an image, return

    const filter = dict.get(PDFName.of("Filter"));
    let imageType = null;

    switch (filter) {
      case PDFName.of("DCTDecode"):
        imageType = "jpg";
        break;
      case PDFName.of("FlateDecode"):
        imageType = "png";
        break;
      case PDFName.of("JPXDecode"):
        imageType = "jpeg2000"; // JPX is typically used for JPEG2000 in PDFs
        break;
      // ... Add more filters for other image formats like WebP, GIF, AVIF, TIFF, SVG etc.
      default:
        console.log(
          `Unsupported image format detected for ref: ${pdfRef}. Filter used: ${filter}`
        );
        return; // If it's neither JPEG nor PNG, return
    }

    // Extract other image information
    const smaskRef = dict.get(PDFName.of("SMask"));
    const colorSpace = dict.get(PDFName.of("ColorSpace"));
    const width = dict.get(PDFName.of("Width"));
    const height = dict.get(PDFName.of("Height"));
    const name = dict.get(PDFName.of("Name"));
    const bitsPerComponent = dict.get(PDFName.of("BitsPerComponent"));

    imagesInDoc.push({
      ref,
      smaskRef,
      colorSpace,
      name: name ? name.key : `Object${objectIdx}`,
      width: width.numberValue,
      height: height.numberValue,
      pxsize: width.numberValue * height.numberValue,
      bitsPerComponent: bitsPerComponent.numberValue,
      data: pdfObject.contents,
      type: imageType,
    });
  });

  return imagesInDoc;
};

In between, you have the function what put the new metadata into the image buffer:

async function generateImageMetadataWatermark(
  imageBufferObj,
  currentPage,
  watermark,
) {
  console.log(`Generating ImageMetadataWatermark for page ${currentPage}...`);
  console.time("generateImageMetadataWatermarkForPage" + currentPage);
  try {
    // Extracting image data and reference
    const actualImageBuffer = imageBufferObj.image;
    const imageRef = imageBufferObj.ref;

    //Convert the full image buffer to base 64
    const base64Image =
      "data:image/jpeg;base64," + actualImageBuffer.toString("base64");
    const exifObj = piexifjs.load(base64Image);

    // Add watermark string in the EXIF data. Using "0th" ImageDescription.
    exifObj["0th"][piexifjs.ImageIFD.ImageDescription] = watermark;
    // Create new EXIF binary string
    const exifBytes = piexifjs.dump(exifObj);
    // Insert the new EXIF data into the image
    const newImageBase64 = piexifjs.insert(exifBytes, base64Image);
    // Convert base64 image to buffer
    const newImageBuffer = Buffer.from(newImageBase64.split(",")[1], "base64");

    // Returning the modified image
    const modifiedImage = {
      watermarkType: "imageMetadata",
      ref: imageRef,
      data: newImageBuffer,
    };
    console.log(`ImageMetadataWatermark for page ${currentPage} generated.`);
    console.timeEnd("generateImageMetadataWatermarkForPage" + currentPage);
    return modifiedImage;
  } catch (e) {
    throw e;
  }
}

Any solution to this?

mcmspark commented 11 months ago

Exif data is appended to the end of the image file. 1 it is not part of the image 2 it makes the file larger

I am not sure exif tags can be added to embedded images

Sent from my iPhone

On Oct 31, 2023, at 12:52 PM, jappoman @.***> wrote:



Hi, I know this is an old thread but I ran into a similar problem. I'm extracting images from a specific page of the pdf to apply additional exif metadata. Next, I put the image buffers back inside the pdf... Except when I extract them again, the exif metadata is completely gone. I'm sure I applied the metadata correctly because if I try to save the image, the metadata is there. The problem therefore arises from re-insertion into the PDF.

This is my code for putting back the image into the pdf:

const replaceImagesInPdf = async (pdfDoc, currentPage, newImages) => { console.log(Replacing images in page ${currentPage}...); console.time("replaceImagesInPdfForPage" + currentPage);

for (let newImage of newImages) { // Cycling throug the image of the only page in pdf const imageData = newImage.data; const imageRef = newImage.ref;

const enumeratedIndirectObjects = pdfDoc.context.enumerateIndirectObjects();
let objectIdx = 0;
enumeratedIndirectObjects.forEach(async ([pdfRef, pdfObject], ref) => {
  objectIdx += 1;

  if (!(pdfObject instanceof PDFRawStream)) return;

  const { dict } = pdfObject;
  const subtype = dict.get(PDFName.of("Subtype"));

  if (subtype == PDFName.of("Image") && ref == imageRef) {
    pdfObject.contents = imageData;
  }
});

}

console.log("Replaced images into page " + currentPage + "."); console.timeEnd("replaceImagesInPdfForPage" + currentPage);

return pdfDoc; };

Any solution to this?

— Reply to this email directly, view it on GitHubhttps://github.com/Hopding/pdf-lib/issues/83#issuecomment-1787606843, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC5P77QABIMHS2HZR7W32GLYCEUEPAVCNFSM4G7FCDG2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYG43DANRYGQZQ. You are receiving this because you commented.Message ID: @.***>

search-acumen commented 7 months ago

In the original extract-image project, this image in existing1.pdf: OriginalExtractImage gets output in triplicate in /images/out21.png, like so: out21 Does anyone know what causes this? I've got the same issue happening when I extract images from a PDF. I have a feeling it's because this code ignores the /Mask operator and the sub-dictionary of the image's ColorSpace that points to the hival (255) and another stream or array (in this case identified as object 37 0 R), like in this image dictionary:

<<
/Type /XObject
/Subtype /Image
/Filter /FlateDecode
/Width 567
/Height 234
/BitsPerComponent 8
/Length 8636
/ColorSpace [ /Indexed /DeviceRGB 255 37 0 R ]
/Mask [ 251 251 ]
>>

Any other ideas? An indexed ColorSpace is described in section 7.6.6.2 of the Acrobat SDK.

Follow-Up I ended up working around this by using the Jimp library to handle the output or any image that use a separate color palette, instead of PNGJS and it works fine.

@K-R-M Could you elaborate on how you got round this issue as I'm facing the same issue with PNGs. I'm already using the Jimp library for other purposes but can't seem to get around the triple image issue. Thanks

K-R-M commented 7 months ago

@search-acumen, unfortunately, I don't remember exactly how I did it. I got laid off and no longer have access to the source code that handled this correctly.

search-acumen commented 7 months ago

@K-R-M No problem, thanks for replying anyway. Has anyone else managed to solve this issue?

devanshsinghvaluecoders commented 6 months ago

@search-acumen try this package to extract the images https://www.npmjs.com/package/pdf-image-extractor

AhmadrezaHK commented 6 months ago

For those who want to extract only form field images in the pdf and not all of them, I update it yannbertrand's code in following way (return image is in base64 format):

export async function extractFormImages(pdfDoc, imageFieldNameList) {
  const enumeratedIndirectObjects = pdfDoc.context.enumerateIndirectObjects()
  const imagesInDoc = []
  let objectIdx = 0

  const form = pdfDoc.getForm()
  const imageRefMap = new Map()

  imageFieldNameList.forEach((fName) => {
    const image = form
      .getButton(fName)
      .acroField.getWidgets()[0]
      .getAppearances()?.normal

    const imageRef = [
      ...image.dict
        .get(PDFName.of("Resources"))
        .dict.get(PDFName.of("XObject"))
        .dict.values(),
    ][0]

    imageRefMap.set(imageRef.toString(), fName)
  })

  enumeratedIndirectObjects.forEach(([pdfRef, pdfObject], ref) => {
    objectIdx += 1

    if (!(pdfObject instanceof PDFRawStream)) return

    const { dict } = pdfObject

    const smaskRef = dict.get(PDFName.of("SMask"))
    const colorSpace = dict.get(PDFName.of("ColorSpace"))
    const subtype = dict.get(PDFName.of("Subtype"))
    const width = dict.get(PDFName.of("Width"))
    const height = dict.get(PDFName.of("Height"))
    const name = dict.get(PDFName.of("Name"))
    const bitsPerComponent = dict.get(PDFName.of("BitsPerComponent"))
    const filter = dict.get(PDFName.of("Filter"))

    if (subtype == PDFName.of("Image") && imageRefMap.has(pdfRef.toString())) {
      imagesInDoc.push({
        ref,
        smaskRef,
        colorSpace,
        name: name ? name.key : `Object${objectIdx}`,
        width: width.numberValue,
        height: height.numberValue,
        bitsPerComponent: bitsPerComponent.numberValue,
        data: pdfObject.contents,
        type: filter === PDFName.of("DCTDecode") ? "jpg" : "png",
        fieldName: imageRefMap.get(pdfRef.toString()),
      })
    }
  })

  const PngColorTypes = {
    Grayscale: 0,
    Rgb: 2,
    GrayscaleAlpha: 4,
    RgbAlpha: 6,
  }
  const ComponentsPerPixelOfColorType = {
    [PngColorTypes.Rgb]: 3,
    [PngColorTypes.Grayscale]: 1,
    [PngColorTypes.RgbAlpha]: 4,
    [PngColorTypes.GrayscaleAlpha]: 2,
  }

  const readBitAtOffsetOfByte = (byte, bitOffset) => {
    const bit = (byte >> bitOffset) & 1
    return bit
  }

  const readBitAtOffsetOfArray = (uint8Array, bitOffsetWithinArray) => {
    const byteOffset = Math.floor(bitOffsetWithinArray / 8)
    const byte = uint8Array[uint8Array.length - byteOffset]
    const bitOffsetWithinByte = Math.floor(bitOffsetWithinArray % 8)
    return readBitAtOffsetOfByte(byte, bitOffsetWithinByte)
  }

  const savePng = (image) =>
    new Promise((resolve, reject) => {
      const isGrayscale = image.colorSpace === PDFName.of("DeviceGray")
      const colorPixels = pako.inflate(image.data)
      const alphaPixels = image.alphaLayer
        ? pako.inflate(image.alphaLayer.data)
        : undefined

      const colorType =
        isGrayscale && alphaPixels
          ? PngColorTypes.GrayscaleAlpha
          : !isGrayscale && alphaPixels
          ? PngColorTypes.RgbAlpha
          : isGrayscale
          ? PngColorTypes.Grayscale
          : PngColorTypes.Rgb

      const colorByteSize = 1
      const width = image.width * colorByteSize
      const height = image.height * colorByteSize
      const inputHasAlpha = [
        PngColorTypes.RgbAlpha,
        PngColorTypes.GrayscaleAlpha,
      ].includes(colorType)

      const pngData = new png.PNG({
        width,
        height,
        colorType,
        inputColorType: colorType,
        inputHasAlpha,
      })

      const componentsPerPixel = ComponentsPerPixelOfColorType[colorType]
      pngData.data = new Uint8Array(width * height * componentsPerPixel)

      let colorPixelIdx = 0
      let pixelIdx = 0

      while (pixelIdx < pngData.data.length) {
        if (colorType === PngColorTypes.Rgb) {
          pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
          pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
          pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
        } else if (colorType === PngColorTypes.RgbAlpha) {
          pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
          pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
          pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
          pngData.data[pixelIdx++] = alphaPixels[colorPixelIdx - 1]
        } else if (colorType === PngColorTypes.Grayscale) {
          const bit =
            readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
              ? 0x00
              : 0xff
          pngData.data[pngData.data.length - pixelIdx++] = bit
        } else if (colorType === PngColorTypes.GrayscaleAlpha) {
          const bit =
            readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
              ? 0x00
              : 0xff
          pngData.data[pngData.data.length - pixelIdx++] = bit
          pngData.data[pngData.data.length - pixelIdx++] =
            alphaPixels[colorPixelIdx - 1]
        } else {
          throw new Error(`Unknown colorType=${colorType}`)
        }
      }

      const buffer = []
      pngData
        .pack()
        .on("data", (data) => buffer.push(...data))
        .on("end", () => resolve(Uint8Array.from(buffer)))
        .on("error", (err) => reject(err))
    })

  let result = {}
  for (const img of imagesInDoc) {
    if (!img.isAlphaLayer) {
      const imageData = img.type === "jpg" ? img.data : await savePng(img)

      const imageBase64 = await new Promise((resolve, reject) => {
        const reader = new FileReader()
        reader.onloadend = () => resolve(reader.result)
        reader.onerror = reject
        reader.readAsDataURL(
          new Blob([imageData], { type: `image/${img.type}` })
        )
      })
      result[img.fieldName] = imageBase64
    }
  }

  return result
}

The key point here is that I'm finding the PDFRef related to the image of the form field and use it to recognise the related PDFObject:

imageFieldNameList.forEach((fName) => {
    const image = form
      .getButton(fName)
      .acroField.getWidgets()[0]
      .getAppearances()?.normal

    const imageRef = [
      ...image.dict
        .get(PDFName.of("Resources"))
        .dict.get(PDFName.of("XObject"))
        .dict.values(),
    ][0]

    imageRefMap.set(imageRef.toString(), fName)
  })

  .
  .
  .

 if (subtype == PDFName.of("Image") && imageRefMap.has(pdfRef.toString())) {

 .
 .
 .
thomaspurk commented 5 months ago

I found this thread very helpful, but, unfortunately, not working for me. I tried many other options for extracting images from PDFs, but none worked. Most options can handle JPG easily but fail on PNG data. Since the technique discussed here at least created PNGs, albeit garbled, I decided to debug this solution, which took many hours. So sharing what is working for me right now.

The core problem was matching and properly indexing the alpha lay to the raw image layer. The original code relied on "ref," which is a number to match to smaskRef, which is an object. The solution was to use pdfRef to match to smaskRef. Also, there was a bug in the original code called out by hafsa110, where the image layer itself was set to the alpha layer instead of the alpha layer. Because the alpha layer is a single-band greyscale image, and not a three-band RGB, after making this correction, we can no longer use the image layer pixel indexer to reference pixel data from the alpha layer. To solve this I created a new alpha layer pixel indexer. I marked key changes with comments below.


const fs = require("fs");
const { PDFDocument, PDFRawStream, PDFName } = require("pdf-lib");
const rimraf = require("rimraf");
const { PNG } = require("pngjs");
const pako = require("pako");

async function getImageFromPdf(inPath) {
  const existingPdfBytes = fs.readFileSync(inPath);
  const pdfDoc = await PDFDocument.load(existingPdfBytes);
  const imagesInDoc = [];

  pdfDoc.context
    .enumerateIndirectObjects()
    .forEach(async ([pdfRef, pdfObject], ref) => {
      if (!(pdfObject instanceof PDFRawStream)) {
        return;
      }
      const { dict } = pdfObject;
      const smaskRef = dict.get(PDFName.of("SMask"));
      const colorSpace = dict.get(PDFName.of("ColorSpace"));
      const subtype = dict.get(PDFName.of("Subtype"));
      const width = dict.get(PDFName.of("Width"));
      const height = dict.get(PDFName.of("Height"));
      const name = dict.get(PDFName.of("Name"));
      const bitsPerComponent = dict.get(PDFName.of("BitsPerComponent"));
      const filter = dict.get(PDFName.of("Filter"));

      if (subtype == PDFName.of("Image")) {
        imagesInDoc.push({
          pdfRef, // added, must use pdfRef to locate alpha layers
          ref,
          smaskRef,
          colorSpace,
          name: name ? name.key : `Object${ref}`,
          width: width.numberValue,
          height: height.numberValue,
          bitsPerComponent: bitsPerComponent.numberValue,
          data: pdfObject.contents,
          type: filter === PDFName.of("DCTDecode") ? "jpg" : "png",
        });
      }
    });

  // Log info about the images we found in the PDF
  console.log(`===== ${imagesInDoc.length} Images found in PDF =====`);
  imagesInDoc.forEach((image) => {
    // Find and mark SMasks as alpha layers
    if (image.type === "png" && image.smaskRef) {
      const smaskImg = imagesInDoc.find((sm) => {
        return image.smaskRef == sm.pdfRef; // ref cannot match to smaskRef, must use pdfRef
      });
      if (smaskImg) {
        smaskImg.isAlphaLayer = true;
        //image.alphaLayer = image; // change suggest by hafsa110, but creates a alpha layer pixel indexing problem (see savePNG)
        image.alphaLayer = smaskImg;
      }
    }
  });

  imagesInDoc.forEach((image) => {
    // Find and mark SMasks as alpha layers

    console.log(
      "Name:",
      image.name,
      "\n  Type:",
      image.type,
      "\n  Color Space:",
      image.colorSpace.toString(),
      "\n  Has Alpha Layer?",
      image.alphaLayer ? image.alphaLayer : false,
      "\n  Is Alpha Layer?",
      image.isAlphaLayer, // change, true or undefined
      "\n  SmaskRef:",
      image.smaskRef, // added to debug the smaskRef
      "\n  Width:",
      image.width,
      "\n  Height:",
      image.height,
      "\n  Bits Per Component:",
      image.bitsPerComponent,
      "\n  Data:",
      `Uint8Array(${image.data.length})`,
      "\n  Ref:",
      image.ref.toString()
    );
  });

  // changed to hard code my folder
  rimraf("./pdf2json/test//*.{jpg,png}", async (err) => {
    if (err) console.error(err);
    else {
      for (const img of imagesInDoc) {
        if (!img.isAlphaLayer) {
          const imageData = img.type === "jpg" ? img.data : await savePng(img);
          fs.writeFileSync(`./pdf2json/test/${img.ref}.` + img.type, imageData);
        }
      }
      console.log();
      console.log("Images written to ./pdf2json/test/");
    }
  });

  console.log("done");
}

const PngColorTypes = {
  Grayscale: 0,
  Rgb: 2,
  GrayscaleAlpha: 4,
  RgbAlpha: 6,
};

const ComponentsPerPixelOfColorType = {
  [PngColorTypes.Rgb]: 3,
  [PngColorTypes.Grayscale]: 1,
  [PngColorTypes.RgbAlpha]: 4,
  [PngColorTypes.GrayscaleAlpha]: 2,
};

const readBitAtOffsetOfByte = (byte, bitOffset) => {
  const bit = (byte >> bitOffset) & 1;
  return bit;
};

const readBitAtOffsetOfArray = (uint8Array, bitOffsetWithinArray) => {
  const byteOffset = Math.floor(bitOffsetWithinArray / 8);
  const byte = uint8Array[uint8Array.length - byteOffset];
  const bitOffsetWithinByte = Math.floor(bitOffsetWithinArray % 8);
  return readBitAtOffsetOfByte(byte, bitOffsetWithinByte);
};

const savePng = (image) =>
  new Promise((resolve, reject) => {
    const isGrayscale = image.colorSpace === PDFName.of("DeviceGray");
    const colorPixels = pako.inflate(image.data);
    const alphaPixels = image.alphaLayer
      ? pako.inflate(image.alphaLayer.data)
      : undefined;

    // prettier-ignore
    const colorType =
        isGrayscale  && alphaPixels ? PngColorTypes.GrayscaleAlpha
      : !isGrayscale && alphaPixels ? PngColorTypes.RgbAlpha
      : isGrayscale                 ? PngColorTypes.Grayscale
      : PngColorTypes.Rgb;

    const colorByteSize = 1;
    const width = image.width * colorByteSize;
    const height = image.height * colorByteSize;
    const inputHasAlpha = [
      PngColorTypes.RgbAlpha,
      PngColorTypes.GrayscaleAlpha,
    ].includes(colorType);

    const png = new PNG({
      width,
      height,
      colorType,
      inputColorType: colorType,
      inputHasAlpha,
    });

    const componentsPerPixel = ComponentsPerPixelOfColorType[colorType];
    png.data = new Uint8Array(width * height * componentsPerPixel);

    let colorPixelIdx = 0;
    let alphaPixelIdx = 0; // add nee index tracker for the alpha later
    let pixelIdx = 0;
    // prettier-ignore
    while (pixelIdx < png.data.length) {
      if (colorType === PngColorTypes.Rgb) {
        png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
        png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
        png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
      } 
      else if (colorType === PngColorTypes.RgbAlpha) {
        png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
        png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
        png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
        //png.data[pixelIdx++] = alphaPixels[colorPixelIdx - 1]; // must reference alpha layer pixel index here
        png.data[pixelIdx++] = alphaPixels[alphaPixelIdx++ -1];

      } 
      else if (colorType === PngColorTypes.Grayscale) {
        const bit = readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0 
          ? 0x00 
          : 0xff;
        png.data[png.data.length - (pixelIdx++)] = bit
      } 
      else if (colorType === PngColorTypes.GrayscaleAlpha) {
        const bit =
          readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
            ? 0x00
            : 0xff;
        png.data[png.data.length - pixelIdx++] = bit;
        //png.data[png.data.length - pixelIdx++] = alphaPixels[colorPixelIdx - 1]; // must reference alpha layer pixel index here
        png.data[png.data.length - pixelIdx++] = alphaPixels[alphaPixelIdx++ - 1];
      } 
      else {
        throw new Error(`Unknown colorType=${colorType}`);
      }
    }

    const buffer = [];
    png
      .pack()
      .on("data", (data) => buffer.push(...data))
      .on("end", () => resolve(Buffer.from(buffer)))
      .on("error", (err) => reject(err));
  });

const pdfSource = "./documents/1960782.pdf";
getImageFromPdf(pdfSource);```
devanshsingh7727 commented 5 months ago

@thomaspurk try this to get png and jpeg images from pdf file https://www.npmjs.com/package/pdf-image-extractor

for implmentation checkout codesandbox

thomaspurk commented 5 months ago

@thomaspurk try this to get png and jpeg images from pdf file https://www.npmjs.com/package/pdf-image-extractor

for implmentation checkout codesandbox

I did try pdf-image-extractor, among several others. This module was able to handle the JPEGs in my PDFs but threw an error on the PNGs.

I just again verified this using the code sandbox link you provided. It was the same behavior as I saw with my test. It gets the JPGs but not the PNGs

Alexufo commented 5 months ago

@thomaspurk have you solved it?

thomaspurk commented 5 months ago

@thomaspurk have you solved it?

Absolutely. The code I posted above is working well for me!

hanifanggawi commented 5 months ago

@thomaspurk have you solved it?

Absolutely. The code I posted above is working well for me!

@thomaspurk what version of pdf-lib is your code using? I am running this on node.js 20, and pdf-lib version 0.6.1, i got a TypeError: PDFDocument.load is not a function, the original code that Hopding wrote does not have this issue, since it uses PDFDocumentFactory to load the pdf

thomaspurk commented 5 months ago

@thomaspurk have you solved it?

Absolutely. The code I posted above is working well for me!

@thomaspurk what version of pdf-lib is your code using? I am running this on node.js 20, and pdf-lib version 0.6.1, i got a TypeError: PDFDocument.load is not a function, the original code that Hopding wrote does not have this issue, since it uses PDFDocumentFactory to load the pdf

Node v20.11.0 pdf-lib 1.17.1

As I recall, Hopding's original code (posted elsewhere not in this issue) did not work for me. I can only assume there has been some refactoring to the module's class names over the versions between 0.6.1 and 1.17.1. The current documentation references the use of PDFDocument.load. See the examples here, https://pdf-lib.js.org/