Closed totorelmatador closed 5 years ago
Hello @totorelmatador!
There are a couple of ways to go about this. Some more challenging than others. I wrote a Node script that "scans" an existing PDF and finds all the images it contains, and redraws them all on a new page:
You just need to unzip the file and run yarn install
(or npm install
) and then run node index.js
. The script will write the new PDF to modified.pdf
. Here's what modified.pdf
looks like:
The script will also log some information about each image in the document, e.g.
Images in PDF:
Name: JfImage0001
Width: 176
Height: 157
Bits Per Component: 1
Data: Uint8Array(1778)
Ref: 20 0 R
...
Name: JfImage0036
Width: 556
Height: 271
Bits Per Component: 8
Data: Uint8Array(461)
Ref: 58 0 R
Let me know if this is what you're looking for, or if you have any questions!
Hello @Hopding !
Thank you so much for your answer ! This is exactly the kind of thing I was trying to do ! But I still have a question. My final objective is to save these images as separated files. I tried to do so with the following code (added to your file index.js) :
var i = 0;
imagesInDoc.forEach(image => {
fs.writeFile("./images/out"+i+".png", image.data, 'base64', function(err) {
console.log(err);
});
i+=1;
});
But it doesn't work well. Saved images can't be opened... The funny thing is that the code works for some images. When I try on this document :
Only one of the two images is saved and ready to be opened. I think that the cause is the transparency of the image, but I would like to know if it is possible to face this issue...
Thank you again for your time !
@totorelmatador Sorry for taking so long to respond to this. I've been swamped with work and school lately, so I haven't had a lot of time to devote to this. However, I've made some progress on creating an example script that shows how to do this (though there are some limitations). I'll try to post a more detailed response soon.
Thank you a lot for your time @Hopding !
I have observed something. When we add a png image in a pdf file, we use the following function:
[imgRef, imgDims] = pdfDoc.embedPNG(PNGimage)
The type of imgRef
is PDFIndirectReference
, the one of imgDims
is PNGXObjectFactory
, and PNGimage
is an image buffer.
When we find all the image objects in the PDF we use the following code:
pdfDoc.index.index.forEach((pdfObject, ref) => {
objectIdx += 1;
if (!(pdfObject instanceof PDFRawStream)) return;
const { lookupMaybe } = pdfDoc.index;
const { dictionary: dict } = pdfObject;
const subtype = lookupMaybe(dict.getMaybe('Subtype'));
const width = lookupMaybe(dict.getMaybe('Width'));
const height = lookupMaybe(dict.getMaybe('Height'));
const name = lookupMaybe(dict.getMaybe('Name'));
const bitsPerComponent = lookupMaybe(dict.getMaybe('BitsPerComponent'));
if (subtype === PDFName.from('Image')) {
imagesInDoc.push({
ref,
name: name ? name.key : `Object${objectIdx}`,
width: width.number,
height: height.number,
bitsPerComponent: bitsPerComponent.number,
data: pdfObject.content,
});
}
});
where every pdfObject
is a PDFRawStream
object and ref
a PDFIndirectReference
. It is possible to extract the image buffer associated to the couple pdfObject
& ref
?
Hello @totorelmatador! I've finally gotten some time to finish up my investigation into this.
First off, it is possible to extract all image types from a PDF using pdf-lib. The question is, how much code will you have to write on top of pdf-lib to do this. It turns out, you'll have to write a fair amount of code if you want to handle all possible images in any type of PDF file.
pdf.js
is an open source PDF rendering engine maintained by Mozilla. It's all written in JavaScript. So, of course, this library must be able to extract and render all types of images. This makes it a very good reference to see how this might be done using pdf-lib
.
In particular, it's PDFImage
class is worth looking at. All of this logic would need to be ported over to use pdf-lib
in order to handle all possible types of images. This is because the embedded image format outlined in the PDF specification is pretty long and complicated (as are many things in PDF files).
All of that being said, I created a script that extracts the more common image formats from PDF files. Here it is:
You just need to unzip the file and run yarn install
(or npm install
) and then run node index.js existing1.pdf
or node index.js existing2.pdf
. The script will extract as many embedded images as it can from the PDF into the images/
directory.
Again, this does not extract all possible types of images. Just the more common formats. It could certainly be improved by porting some code from pdf.js
.
I did a bit of googling to see if pdf.js
has an API to extract images from PDFs. It looks like this may be possible for certain types of images: https://github.com/mozilla/pdf.js/issues/7813 https://github.com/mozilla/pdf.js/issues/7043. But full support doesn't yet seem available.
I think that adding proper support for image extraction would be an interesting feature to implement in pdf-lib
. I imagine it would be quite useful to many developers. However, unless somebody from the community decides to work on this, there are several other things I have to work on first. So it'll be awhile before this feature lands in pdf-lib
.
It does work perfectly, thank you a lot !
thanks a lot, this helps me 29/07/2019 :)
Thank you so much! Exactly what I needed and worked perfectly. Aces!
This does not work on scanned pdf. It results in an "unknown compression method error"
The pages are filter = JBIG2Decode
@Hopding thanks for taking the time to post such a useful reply and provide the script. I was wondering if there's an easy way to get the x/y position of the image as well as the width/height?
My pdf contains images and tables. I need to remove images from all pages of pdf and keep tables as they are and save new document is it possible?
Hi, I tried the solution of this comment :
Hello @totorelmatador! I've finally gotten some time to finish up my investigation into this.
First off, it is possible to extract all image types from a PDF using pdf-lib. The question is, how much code will you have to write on top of pdf-lib to do this. It turns out, you'll have to write a fair amount of code if you want to handle all possible images in any type of PDF file.
pdf.js
is an open source PDF rendering engine maintained by Mozilla. It's all written in JavaScript. So, of course, this library must be able to extract and render all types of images. This makes it a very good reference to see how this might be done usingpdf-lib
.In particular, it's
PDFImage
class is worth looking at. All of this logic would need to be ported over to usepdf-lib
in order to handle all possible types of images. This is because the embedded image format outlined in the PDF specification is pretty long and complicated (as are many things in PDF files).All of that being said, I created a script that extracts the more common image formats from PDF files. Here it is:
You just need to unzip the file and run
yarn install
(ornpm install
) and then runnode index.js existing1.pdf
ornode index.js existing2.pdf
. The script will extract as many embedded images as it can from the PDF into theimages/
directory.Again, this does not extract all possible types of images. Just the more common formats. It could certainly be improved by porting some code from
pdf.js
.I did a bit of googling to see if
pdf.js
has an API to extract images from PDFs. It looks like this may be possible for certain types of images: mozilla/pdf.js#7813 mozilla/pdf.js#7043. But full support doesn't yet seem available.I think that adding proper support for image extraction would be an interesting feature to implement in
pdf-lib
. I imagine it would be quite useful to many developers. However, unless somebody from the community decides to work on this, there are several other things I have to work on first. So it'll be awhile before this feature lands inpdf-lib
.
When saving the png files, I noticed that the alphaLayer used is the image itself and not the real alphaLayer that we get
So i changed it, and added image.alphaLayer = smaskimg;
The problem is now that the image doesn't load completely, as if the smask and the image itself had different dimensions. I don't know if someone has encountered this error before ?
Thanks :)
ps : the full image without smask
the full image when adding smaskimg
Hi, I tried the solution of this comment :
Hello @totorelmatador! I've finally gotten some time to finish up my investigation into this. First off, it is possible to extract all image types from a PDF using pdf-lib. The question is, how much code will you have to write on top of pdf-lib to do this. It turns out, you'll have to write a fair amount of code if you want to handle all possible images in any type of PDF file.
pdf.js
is an open source PDF rendering engine maintained by Mozilla. It's all written in JavaScript. So, of course, this library must be able to extract and render all types of images. This makes it a very good reference to see how this might be done usingpdf-lib
. In particular, it'sPDFImage
class is worth looking at. All of this logic would need to be ported over to usepdf-lib
in order to handle all possible types of images. This is because the embedded image format outlined in the PDF specification is pretty long and complicated (as are many things in PDF files). All of that being said, I created a script that extracts the more common image formats from PDF files. Here it is: extract-images.zip You just need to unzip the file and runyarn install
(ornpm install
) and then runnode index.js existing1.pdf
ornode index.js existing2.pdf
. The script will extract as many embedded images as it can from the PDF into theimages/
directory. Again, this does not extract all possible types of images. Just the more common formats. It could certainly be improved by porting some code frompdf.js
. I did a bit of googling to see ifpdf.js
has an API to extract images from PDFs. It looks like this may be possible for certain types of images: mozilla/pdf.js#7813 mozilla/pdf.js#7043. But full support doesn't yet seem available. I think that adding proper support for image extraction would be an interesting feature to implement inpdf-lib
. I imagine it would be quite useful to many developers. However, unless somebody from the community decides to work on this, there are several other things I have to work on first. So it'll be awhile before this feature lands inpdf-lib
.When saving the png files, I noticed that the alphaLayer used is the image itself and not the real alphaLayer that we get
So i changed it, and added image.alphaLayer = smaskimg;
The problem is now that the image doesn't load completely, as if the smask and the image itself had different dimensions. I don't know if someone has encountered this error before ?
Thanks :)
ps : the full image without smask
the full image when adding smaskimg
Hi, hafsa110 I solved this problem just removed "- 1 " in savePng function
Just a small refresh of the proposed code in another context, this one should work as-is in the browser:
<html>
<head>
<meta charset="utf-8" />
<script src="https://unpkg.com/pngjs@6.0.0/browser.js"></script>
<script src="https://unpkg.com/pdf-lib@1.17.1/dist/pdf-lib.js"></script>
<script src="https://unpkg.com/pako@2.0.4/dist/pako.js"></script>
</head>
<body>
<input type="file" id="ticket" />
<div id="images"></div>
<script>
const fileInput = document.getElementById('ticket');
const imagesContainer = document.getElementById('images');
fileInput.addEventListener('change', async (event) => {
imagesContainer.innerHTML = '';
const buffer = await event.target.files[0].arrayBuffer();
await extractPdfImages(buffer);
});
const extractPdfImages = async (pdfBytes) => {
const pdfDoc = await PDFLib.PDFDocument.load(pdfBytes);
const enumeratedIndirectObjects =
pdfDoc.context.enumerateIndirectObjects();
const imagesInDoc = [];
let objectIdx = 0;
enumeratedIndirectObjects.forEach(async ([pdfRef, pdfObject], ref) => {
objectIdx += 1;
if (!(pdfObject instanceof PDFLib.PDFRawStream)) return;
const { dict } = pdfObject;
const smaskRef = dict.get(PDFLib.PDFName.of('SMask'));
const colorSpace = dict.get(PDFLib.PDFName.of('ColorSpace'));
const subtype = dict.get(PDFLib.PDFName.of('Subtype'));
const width = dict.get(PDFLib.PDFName.of('Width'));
const height = dict.get(PDFLib.PDFName.of('Height'));
const name = dict.get(PDFLib.PDFName.of('Name'));
const bitsPerComponent = dict.get(
PDFLib.PDFName.of('BitsPerComponent')
);
const filter = dict.get(PDFLib.PDFName.of('Filter'));
if (subtype == PDFLib.PDFName.of('Image')) {
imagesInDoc.push({
ref,
smaskRef,
colorSpace,
name: name ? name.key : `Object${objectIdx}`,
width: width.numberValue,
height: height.numberValue,
bitsPerComponent: bitsPerComponent.numberValue,
data: pdfObject.contents,
type: filter === PDFLib.PDFName.of('DCTDecode') ? 'jpg' : 'png',
});
}
});
// Find and mark SMasks as alpha layers
// Note: doesn't work in all PDFs, I decided to remove it
// imagesInDoc.forEach((image) => {
// if (image.type === 'png' && image.smaskRef) {
// const smaskImg = imagesInDoc.find(
// ({ ref }) => ref === image.smaskRef
// );
// smaskImg.isAlphaLayer = true;
// image.alphaLayer = image;
// }
// });
// Log info about the images we found in the PDF
console.log(`===== ${imagesInDoc.length} Images found in PDF =====`);
imagesInDoc.forEach((image) => {
console.log(
'Name:',
image.name,
'\n Type:',
image.type,
'\n Color Space:',
image.colorSpace.toString(),
'\n Has Alpha Layer?',
image.alphaLayer ? true : false,
// '\n Is Alpha Layer?',
// image.isAlphaLayer || false,
'\n Width:',
image.width,
'\n Height:',
image.height,
'\n Bits Per Component:',
image.bitsPerComponent,
'\n Data:',
`Uint8Array(${image.data.length})`,
'\n Ref:',
image.ref.toString()
);
});
const PngColorTypes = {
Grayscale: 0,
Rgb: 2,
GrayscaleAlpha: 4,
RgbAlpha: 6,
};
const ComponentsPerPixelOfColorType = {
[PngColorTypes.Rgb]: 3,
[PngColorTypes.Grayscale]: 1,
[PngColorTypes.RgbAlpha]: 4,
[PngColorTypes.GrayscaleAlpha]: 2,
};
const readBitAtOffsetOfByte = (byte, bitOffset) => {
const bit = (byte >> bitOffset) & 1;
return bit;
};
const readBitAtOffsetOfArray = (uint8Array, bitOffsetWithinArray) => {
const byteOffset = Math.floor(bitOffsetWithinArray / 8);
const byte = uint8Array[uint8Array.length - byteOffset];
const bitOffsetWithinByte = Math.floor(bitOffsetWithinArray % 8);
return readBitAtOffsetOfByte(byte, bitOffsetWithinByte);
};
const savePng = (image) =>
new Promise((resolve, reject) => {
const isGrayscale =
image.colorSpace === PDFLib.PDFName.of('DeviceGray');
const colorPixels = pako.inflate(image.data);
const alphaPixels = image.alphaLayer
? pako.inflate(image.alphaLayer.data)
: undefined;
const colorType =
isGrayscale && alphaPixels
? PngColorTypes.GrayscaleAlpha
: !isGrayscale && alphaPixels
? PngColorTypes.RgbAlpha
: isGrayscale
? PngColorTypes.Grayscale
: PngColorTypes.Rgb;
const colorByteSize = 1;
const width = image.width * colorByteSize;
const height = image.height * colorByteSize;
const inputHasAlpha = [
PngColorTypes.RgbAlpha,
PngColorTypes.GrayscaleAlpha,
].includes(colorType);
const pngData = new png.PNG({
width,
height,
colorType,
inputColorType: colorType,
inputHasAlpha,
});
const componentsPerPixel = ComponentsPerPixelOfColorType[colorType];
pngData.data = new Uint8Array(width * height * componentsPerPixel);
let colorPixelIdx = 0;
let pixelIdx = 0;
while (pixelIdx < pngData.data.length) {
if (colorType === PngColorTypes.Rgb) {
pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
} else if (colorType === PngColorTypes.RgbAlpha) {
pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++];
pngData.data[pixelIdx++] = alphaPixels[colorPixelIdx - 1];
} else if (colorType === PngColorTypes.Grayscale) {
const bit =
readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
? 0x00
: 0xff;
pngData.data[pngData.data.length - pixelIdx++] = bit;
} else if (colorType === PngColorTypes.GrayscaleAlpha) {
const bit =
readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
? 0x00
: 0xff;
pngData.data[pngData.data.length - pixelIdx++] = bit;
pngData.data[pngData.data.length - pixelIdx++] =
alphaPixels[colorPixelIdx - 1];
} else {
throw new Error(`Unknown colorType=${colorType}`);
}
}
const buffer = [];
pngData
.pack()
.on('data', (data) => buffer.push(...data))
.on('end', () => resolve(Uint8Array.from(buffer)))
.on('error', (err) => reject(err));
});
for (const image of imagesInDoc) {
if (!image.isAlphaLayer) {
const imageData =
image.type === 'jpg' ? image.data : await savePng(image);
const imgElement = document.createElement('img');
imgElement.setAttribute(
'src',
URL.createObjectURL(
new Blob([imageData], { type: `image/${image.type}` })
)
);
imgElement.setAttribute('width', image.width);
imgElement.setAttribute('height', image.height);
imagesContainer.appendChild(imgElement);
}
}
};
</script>
</body>
</html>
Is it possible to get the x,y position of the images?
In the original extract-image project, this image in existing1.pdf:
gets output in triplicate in /images/out21.png, like so:
Does anyone know what causes this? I've got the same issue happening when I extract images from a PDF. I have a feeling it's because this code ignores the /Mask operator and the sub-dictionary of the image's ColorSpace
that points to the hival (255) and another stream or array (in this case identified as object 37 0 R
), like in this image dictionary:
<<
/Type /XObject
/Subtype /Image
/Filter /FlateDecode
/Width 567
/Height 234
/BitsPerComponent 8
/Length 8636
/ColorSpace [ /Indexed /DeviceRGB 255 37 0 R ]
/Mask [ 251 251 ]
>>
Any other ideas? An indexed ColorSpace is described in section 7.6.6.2 of the Acrobat SDK.
Follow-Up I ended up working around this by using the Jimp library to handle the output or any image that use a separate color palette, instead of PNGJS and it works fine.
Hi, I know this is an old thread but I ran into a similar problem. I'm extracting images from a specific page of the pdf to apply additional exif metadata. Next, I put the image buffers back inside the pdf... Except when I extract them again, the exif metadata is completely gone. I'm sure I applied the metadata correctly because if I try to save the image, the metadata is there. The problem therefore arises from re-insertion into the PDF.
This is my code for putting back the image into the pdf:
const replaceImagesInPdf = async (pdfDoc, currentPage, newImages) => {
console.log(`Replacing images in page ${currentPage}...`);
console.time("replaceImagesInPdfForPage" + currentPage);
for (let newImage of newImages) {
// Cycling throug the image of the only page in pdf
const imageData = newImage.data;
const imageRef = newImage.ref;
const enumeratedIndirectObjects = pdfDoc.context.enumerateIndirectObjects();
let objectIdx = 0;
enumeratedIndirectObjects.forEach(async ([pdfRef, pdfObject], ref) => {
objectIdx += 1;
if (!(pdfObject instanceof PDFRawStream)) return;
const { dict } = pdfObject;
const subtype = dict.get(PDFName.of("Subtype"));
if (subtype == PDFName.of("Image") && ref == imageRef) {
pdfObject.contents = imageData;
}
});
}
console.log("Replaced images into page " + currentPage + ".");
console.timeEnd("replaceImagesInPdfForPage" + currentPage);
return pdfDoc;
};
This is how i extract the image from the pdf:
const indexPDFImages = async (pdfDoc) => {
const enumeratedIndirectObjects = pdfDoc.context.enumerateIndirectObjects();
const imagesInDoc = [];
let objectIdx = 0;
enumeratedIndirectObjects.forEach(async ([pdfRef, pdfObject], ref) => {
objectIdx += 1;
if (!(pdfObject instanceof PDFRawStream)) return;
const { dict } = pdfObject;
const subtype = dict.get(PDFName.of("Subtype"));
if (subtype !== PDFName.of("Image")) return; // If it's not an image, return
const filter = dict.get(PDFName.of("Filter"));
let imageType = null;
switch (filter) {
case PDFName.of("DCTDecode"):
imageType = "jpg";
break;
case PDFName.of("FlateDecode"):
imageType = "png";
break;
case PDFName.of("JPXDecode"):
imageType = "jpeg2000"; // JPX is typically used for JPEG2000 in PDFs
break;
// ... Add more filters for other image formats like WebP, GIF, AVIF, TIFF, SVG etc.
default:
console.log(
`Unsupported image format detected for ref: ${pdfRef}. Filter used: ${filter}`
);
return; // If it's neither JPEG nor PNG, return
}
// Extract other image information
const smaskRef = dict.get(PDFName.of("SMask"));
const colorSpace = dict.get(PDFName.of("ColorSpace"));
const width = dict.get(PDFName.of("Width"));
const height = dict.get(PDFName.of("Height"));
const name = dict.get(PDFName.of("Name"));
const bitsPerComponent = dict.get(PDFName.of("BitsPerComponent"));
imagesInDoc.push({
ref,
smaskRef,
colorSpace,
name: name ? name.key : `Object${objectIdx}`,
width: width.numberValue,
height: height.numberValue,
pxsize: width.numberValue * height.numberValue,
bitsPerComponent: bitsPerComponent.numberValue,
data: pdfObject.contents,
type: imageType,
});
});
return imagesInDoc;
};
In between, you have the function what put the new metadata into the image buffer:
async function generateImageMetadataWatermark(
imageBufferObj,
currentPage,
watermark,
) {
console.log(`Generating ImageMetadataWatermark for page ${currentPage}...`);
console.time("generateImageMetadataWatermarkForPage" + currentPage);
try {
// Extracting image data and reference
const actualImageBuffer = imageBufferObj.image;
const imageRef = imageBufferObj.ref;
//Convert the full image buffer to base 64
const base64Image =
"data:image/jpeg;base64," + actualImageBuffer.toString("base64");
const exifObj = piexifjs.load(base64Image);
// Add watermark string in the EXIF data. Using "0th" ImageDescription.
exifObj["0th"][piexifjs.ImageIFD.ImageDescription] = watermark;
// Create new EXIF binary string
const exifBytes = piexifjs.dump(exifObj);
// Insert the new EXIF data into the image
const newImageBase64 = piexifjs.insert(exifBytes, base64Image);
// Convert base64 image to buffer
const newImageBuffer = Buffer.from(newImageBase64.split(",")[1], "base64");
// Returning the modified image
const modifiedImage = {
watermarkType: "imageMetadata",
ref: imageRef,
data: newImageBuffer,
};
console.log(`ImageMetadataWatermark for page ${currentPage} generated.`);
console.timeEnd("generateImageMetadataWatermarkForPage" + currentPage);
return modifiedImage;
} catch (e) {
throw e;
}
}
Any solution to this?
Exif data is appended to the end of the image file. 1 it is not part of the image 2 it makes the file larger
I am not sure exif tags can be added to embedded images
Sent from my iPhone
On Oct 31, 2023, at 12:52 PM, jappoman @.***> wrote:
Hi, I know this is an old thread but I ran into a similar problem. I'm extracting images from a specific page of the pdf to apply additional exif metadata. Next, I put the image buffers back inside the pdf... Except when I extract them again, the exif metadata is completely gone. I'm sure I applied the metadata correctly because if I try to save the image, the metadata is there. The problem therefore arises from re-insertion into the PDF.
This is my code for putting back the image into the pdf:
const replaceImagesInPdf = async (pdfDoc, currentPage, newImages) => {
console.log(Replacing images in page ${currentPage}...
);
console.time("replaceImagesInPdfForPage" + currentPage);
for (let newImage of newImages) { // Cycling throug the image of the only page in pdf const imageData = newImage.data; const imageRef = newImage.ref;
const enumeratedIndirectObjects = pdfDoc.context.enumerateIndirectObjects();
let objectIdx = 0;
enumeratedIndirectObjects.forEach(async ([pdfRef, pdfObject], ref) => {
objectIdx += 1;
if (!(pdfObject instanceof PDFRawStream)) return;
const { dict } = pdfObject;
const subtype = dict.get(PDFName.of("Subtype"));
if (subtype == PDFName.of("Image") && ref == imageRef) {
pdfObject.contents = imageData;
}
});
}
console.log("Replaced images into page " + currentPage + "."); console.timeEnd("replaceImagesInPdfForPage" + currentPage);
return pdfDoc; };
Any solution to this?
— Reply to this email directly, view it on GitHubhttps://github.com/Hopding/pdf-lib/issues/83#issuecomment-1787606843, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC5P77QABIMHS2HZR7W32GLYCEUEPAVCNFSM4G7FCDG2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYG43DANRYGQZQ. You are receiving this because you commented.Message ID: @.***>
In the original extract-image project, this image in existing1.pdf: gets output in triplicate in /images/out21.png, like so: Does anyone know what causes this? I've got the same issue happening when I extract images from a PDF. I have a feeling it's because this code ignores the /Mask operator and the sub-dictionary of the image's
ColorSpace
that points to the hival (255) and another stream or array (in this case identified as object37 0 R
), like in this image dictionary:<< /Type /XObject /Subtype /Image /Filter /FlateDecode /Width 567 /Height 234 /BitsPerComponent 8 /Length 8636 /ColorSpace [ /Indexed /DeviceRGB 255 37 0 R ] /Mask [ 251 251 ] >>
Any other ideas? An indexed ColorSpace is described in section 7.6.6.2 of the Acrobat SDK.
Follow-Up I ended up working around this by using the Jimp library to handle the output or any image that use a separate color palette, instead of PNGJS and it works fine.
@K-R-M Could you elaborate on how you got round this issue as I'm facing the same issue with PNGs. I'm already using the Jimp library for other purposes but can't seem to get around the triple image issue. Thanks
@search-acumen, unfortunately, I don't remember exactly how I did it. I got laid off and no longer have access to the source code that handled this correctly.
@K-R-M No problem, thanks for replying anyway. Has anyone else managed to solve this issue?
@search-acumen try this package to extract the images https://www.npmjs.com/package/pdf-image-extractor
For those who want to extract only form field images in the pdf and not all of them, I update it yannbertrand's code in following way (return image is in base64 format):
export async function extractFormImages(pdfDoc, imageFieldNameList) {
const enumeratedIndirectObjects = pdfDoc.context.enumerateIndirectObjects()
const imagesInDoc = []
let objectIdx = 0
const form = pdfDoc.getForm()
const imageRefMap = new Map()
imageFieldNameList.forEach((fName) => {
const image = form
.getButton(fName)
.acroField.getWidgets()[0]
.getAppearances()?.normal
const imageRef = [
...image.dict
.get(PDFName.of("Resources"))
.dict.get(PDFName.of("XObject"))
.dict.values(),
][0]
imageRefMap.set(imageRef.toString(), fName)
})
enumeratedIndirectObjects.forEach(([pdfRef, pdfObject], ref) => {
objectIdx += 1
if (!(pdfObject instanceof PDFRawStream)) return
const { dict } = pdfObject
const smaskRef = dict.get(PDFName.of("SMask"))
const colorSpace = dict.get(PDFName.of("ColorSpace"))
const subtype = dict.get(PDFName.of("Subtype"))
const width = dict.get(PDFName.of("Width"))
const height = dict.get(PDFName.of("Height"))
const name = dict.get(PDFName.of("Name"))
const bitsPerComponent = dict.get(PDFName.of("BitsPerComponent"))
const filter = dict.get(PDFName.of("Filter"))
if (subtype == PDFName.of("Image") && imageRefMap.has(pdfRef.toString())) {
imagesInDoc.push({
ref,
smaskRef,
colorSpace,
name: name ? name.key : `Object${objectIdx}`,
width: width.numberValue,
height: height.numberValue,
bitsPerComponent: bitsPerComponent.numberValue,
data: pdfObject.contents,
type: filter === PDFName.of("DCTDecode") ? "jpg" : "png",
fieldName: imageRefMap.get(pdfRef.toString()),
})
}
})
const PngColorTypes = {
Grayscale: 0,
Rgb: 2,
GrayscaleAlpha: 4,
RgbAlpha: 6,
}
const ComponentsPerPixelOfColorType = {
[PngColorTypes.Rgb]: 3,
[PngColorTypes.Grayscale]: 1,
[PngColorTypes.RgbAlpha]: 4,
[PngColorTypes.GrayscaleAlpha]: 2,
}
const readBitAtOffsetOfByte = (byte, bitOffset) => {
const bit = (byte >> bitOffset) & 1
return bit
}
const readBitAtOffsetOfArray = (uint8Array, bitOffsetWithinArray) => {
const byteOffset = Math.floor(bitOffsetWithinArray / 8)
const byte = uint8Array[uint8Array.length - byteOffset]
const bitOffsetWithinByte = Math.floor(bitOffsetWithinArray % 8)
return readBitAtOffsetOfByte(byte, bitOffsetWithinByte)
}
const savePng = (image) =>
new Promise((resolve, reject) => {
const isGrayscale = image.colorSpace === PDFName.of("DeviceGray")
const colorPixels = pako.inflate(image.data)
const alphaPixels = image.alphaLayer
? pako.inflate(image.alphaLayer.data)
: undefined
const colorType =
isGrayscale && alphaPixels
? PngColorTypes.GrayscaleAlpha
: !isGrayscale && alphaPixels
? PngColorTypes.RgbAlpha
: isGrayscale
? PngColorTypes.Grayscale
: PngColorTypes.Rgb
const colorByteSize = 1
const width = image.width * colorByteSize
const height = image.height * colorByteSize
const inputHasAlpha = [
PngColorTypes.RgbAlpha,
PngColorTypes.GrayscaleAlpha,
].includes(colorType)
const pngData = new png.PNG({
width,
height,
colorType,
inputColorType: colorType,
inputHasAlpha,
})
const componentsPerPixel = ComponentsPerPixelOfColorType[colorType]
pngData.data = new Uint8Array(width * height * componentsPerPixel)
let colorPixelIdx = 0
let pixelIdx = 0
while (pixelIdx < pngData.data.length) {
if (colorType === PngColorTypes.Rgb) {
pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
} else if (colorType === PngColorTypes.RgbAlpha) {
pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
pngData.data[pixelIdx++] = colorPixels[colorPixelIdx++]
pngData.data[pixelIdx++] = alphaPixels[colorPixelIdx - 1]
} else if (colorType === PngColorTypes.Grayscale) {
const bit =
readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
? 0x00
: 0xff
pngData.data[pngData.data.length - pixelIdx++] = bit
} else if (colorType === PngColorTypes.GrayscaleAlpha) {
const bit =
readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
? 0x00
: 0xff
pngData.data[pngData.data.length - pixelIdx++] = bit
pngData.data[pngData.data.length - pixelIdx++] =
alphaPixels[colorPixelIdx - 1]
} else {
throw new Error(`Unknown colorType=${colorType}`)
}
}
const buffer = []
pngData
.pack()
.on("data", (data) => buffer.push(...data))
.on("end", () => resolve(Uint8Array.from(buffer)))
.on("error", (err) => reject(err))
})
let result = {}
for (const img of imagesInDoc) {
if (!img.isAlphaLayer) {
const imageData = img.type === "jpg" ? img.data : await savePng(img)
const imageBase64 = await new Promise((resolve, reject) => {
const reader = new FileReader()
reader.onloadend = () => resolve(reader.result)
reader.onerror = reject
reader.readAsDataURL(
new Blob([imageData], { type: `image/${img.type}` })
)
})
result[img.fieldName] = imageBase64
}
}
return result
}
The key point here is that I'm finding the PDFRef
related to the image of the form field and use it to recognise the related PDFObject:
imageFieldNameList.forEach((fName) => {
const image = form
.getButton(fName)
.acroField.getWidgets()[0]
.getAppearances()?.normal
const imageRef = [
...image.dict
.get(PDFName.of("Resources"))
.dict.get(PDFName.of("XObject"))
.dict.values(),
][0]
imageRefMap.set(imageRef.toString(), fName)
})
.
.
.
if (subtype == PDFName.of("Image") && imageRefMap.has(pdfRef.toString())) {
.
.
.
I found this thread very helpful, but, unfortunately, not working for me. I tried many other options for extracting images from PDFs, but none worked. Most options can handle JPG easily but fail on PNG data. Since the technique discussed here at least created PNGs, albeit garbled, I decided to debug this solution, which took many hours. So sharing what is working for me right now.
The core problem was matching and properly indexing the alpha lay to the raw image layer. The original code relied on "ref," which is a number to match to smaskRef, which is an object. The solution was to use pdfRef to match to smaskRef. Also, there was a bug in the original code called out by hafsa110, where the image layer itself was set to the alpha layer instead of the alpha layer. Because the alpha layer is a single-band greyscale image, and not a three-band RGB, after making this correction, we can no longer use the image layer pixel indexer to reference pixel data from the alpha layer. To solve this I created a new alpha layer pixel indexer. I marked key changes with comments below.
const fs = require("fs");
const { PDFDocument, PDFRawStream, PDFName } = require("pdf-lib");
const rimraf = require("rimraf");
const { PNG } = require("pngjs");
const pako = require("pako");
async function getImageFromPdf(inPath) {
const existingPdfBytes = fs.readFileSync(inPath);
const pdfDoc = await PDFDocument.load(existingPdfBytes);
const imagesInDoc = [];
pdfDoc.context
.enumerateIndirectObjects()
.forEach(async ([pdfRef, pdfObject], ref) => {
if (!(pdfObject instanceof PDFRawStream)) {
return;
}
const { dict } = pdfObject;
const smaskRef = dict.get(PDFName.of("SMask"));
const colorSpace = dict.get(PDFName.of("ColorSpace"));
const subtype = dict.get(PDFName.of("Subtype"));
const width = dict.get(PDFName.of("Width"));
const height = dict.get(PDFName.of("Height"));
const name = dict.get(PDFName.of("Name"));
const bitsPerComponent = dict.get(PDFName.of("BitsPerComponent"));
const filter = dict.get(PDFName.of("Filter"));
if (subtype == PDFName.of("Image")) {
imagesInDoc.push({
pdfRef, // added, must use pdfRef to locate alpha layers
ref,
smaskRef,
colorSpace,
name: name ? name.key : `Object${ref}`,
width: width.numberValue,
height: height.numberValue,
bitsPerComponent: bitsPerComponent.numberValue,
data: pdfObject.contents,
type: filter === PDFName.of("DCTDecode") ? "jpg" : "png",
});
}
});
// Log info about the images we found in the PDF
console.log(`===== ${imagesInDoc.length} Images found in PDF =====`);
imagesInDoc.forEach((image) => {
// Find and mark SMasks as alpha layers
if (image.type === "png" && image.smaskRef) {
const smaskImg = imagesInDoc.find((sm) => {
return image.smaskRef == sm.pdfRef; // ref cannot match to smaskRef, must use pdfRef
});
if (smaskImg) {
smaskImg.isAlphaLayer = true;
//image.alphaLayer = image; // change suggest by hafsa110, but creates a alpha layer pixel indexing problem (see savePNG)
image.alphaLayer = smaskImg;
}
}
});
imagesInDoc.forEach((image) => {
// Find and mark SMasks as alpha layers
console.log(
"Name:",
image.name,
"\n Type:",
image.type,
"\n Color Space:",
image.colorSpace.toString(),
"\n Has Alpha Layer?",
image.alphaLayer ? image.alphaLayer : false,
"\n Is Alpha Layer?",
image.isAlphaLayer, // change, true or undefined
"\n SmaskRef:",
image.smaskRef, // added to debug the smaskRef
"\n Width:",
image.width,
"\n Height:",
image.height,
"\n Bits Per Component:",
image.bitsPerComponent,
"\n Data:",
`Uint8Array(${image.data.length})`,
"\n Ref:",
image.ref.toString()
);
});
// changed to hard code my folder
rimraf("./pdf2json/test//*.{jpg,png}", async (err) => {
if (err) console.error(err);
else {
for (const img of imagesInDoc) {
if (!img.isAlphaLayer) {
const imageData = img.type === "jpg" ? img.data : await savePng(img);
fs.writeFileSync(`./pdf2json/test/${img.ref}.` + img.type, imageData);
}
}
console.log();
console.log("Images written to ./pdf2json/test/");
}
});
console.log("done");
}
const PngColorTypes = {
Grayscale: 0,
Rgb: 2,
GrayscaleAlpha: 4,
RgbAlpha: 6,
};
const ComponentsPerPixelOfColorType = {
[PngColorTypes.Rgb]: 3,
[PngColorTypes.Grayscale]: 1,
[PngColorTypes.RgbAlpha]: 4,
[PngColorTypes.GrayscaleAlpha]: 2,
};
const readBitAtOffsetOfByte = (byte, bitOffset) => {
const bit = (byte >> bitOffset) & 1;
return bit;
};
const readBitAtOffsetOfArray = (uint8Array, bitOffsetWithinArray) => {
const byteOffset = Math.floor(bitOffsetWithinArray / 8);
const byte = uint8Array[uint8Array.length - byteOffset];
const bitOffsetWithinByte = Math.floor(bitOffsetWithinArray % 8);
return readBitAtOffsetOfByte(byte, bitOffsetWithinByte);
};
const savePng = (image) =>
new Promise((resolve, reject) => {
const isGrayscale = image.colorSpace === PDFName.of("DeviceGray");
const colorPixels = pako.inflate(image.data);
const alphaPixels = image.alphaLayer
? pako.inflate(image.alphaLayer.data)
: undefined;
// prettier-ignore
const colorType =
isGrayscale && alphaPixels ? PngColorTypes.GrayscaleAlpha
: !isGrayscale && alphaPixels ? PngColorTypes.RgbAlpha
: isGrayscale ? PngColorTypes.Grayscale
: PngColorTypes.Rgb;
const colorByteSize = 1;
const width = image.width * colorByteSize;
const height = image.height * colorByteSize;
const inputHasAlpha = [
PngColorTypes.RgbAlpha,
PngColorTypes.GrayscaleAlpha,
].includes(colorType);
const png = new PNG({
width,
height,
colorType,
inputColorType: colorType,
inputHasAlpha,
});
const componentsPerPixel = ComponentsPerPixelOfColorType[colorType];
png.data = new Uint8Array(width * height * componentsPerPixel);
let colorPixelIdx = 0;
let alphaPixelIdx = 0; // add nee index tracker for the alpha later
let pixelIdx = 0;
// prettier-ignore
while (pixelIdx < png.data.length) {
if (colorType === PngColorTypes.Rgb) {
png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
}
else if (colorType === PngColorTypes.RgbAlpha) {
png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
png.data[pixelIdx++] = colorPixels[colorPixelIdx++];
//png.data[pixelIdx++] = alphaPixels[colorPixelIdx - 1]; // must reference alpha layer pixel index here
png.data[pixelIdx++] = alphaPixels[alphaPixelIdx++ -1];
}
else if (colorType === PngColorTypes.Grayscale) {
const bit = readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
? 0x00
: 0xff;
png.data[png.data.length - (pixelIdx++)] = bit
}
else if (colorType === PngColorTypes.GrayscaleAlpha) {
const bit =
readBitAtOffsetOfArray(colorPixels, colorPixelIdx++) === 0
? 0x00
: 0xff;
png.data[png.data.length - pixelIdx++] = bit;
//png.data[png.data.length - pixelIdx++] = alphaPixels[colorPixelIdx - 1]; // must reference alpha layer pixel index here
png.data[png.data.length - pixelIdx++] = alphaPixels[alphaPixelIdx++ - 1];
}
else {
throw new Error(`Unknown colorType=${colorType}`);
}
}
const buffer = [];
png
.pack()
.on("data", (data) => buffer.push(...data))
.on("end", () => resolve(Buffer.from(buffer)))
.on("error", (err) => reject(err));
});
const pdfSource = "./documents/1960782.pdf";
getImageFromPdf(pdfSource);```
@thomaspurk try this to get png and jpeg images from pdf file https://www.npmjs.com/package/pdf-image-extractor
for implmentation checkout codesandbox
@thomaspurk try this to get png and jpeg images from pdf file https://www.npmjs.com/package/pdf-image-extractor
for implmentation checkout codesandbox
I did try pdf-image-extractor, among several others. This module was able to handle the JPEGs in my PDFs but threw an error on the PNGs.
I just again verified this using the code sandbox link you provided. It was the same behavior as I saw with my test. It gets the JPGs but not the PNGs
@thomaspurk have you solved it?
@thomaspurk have you solved it?
Absolutely. The code I posted above is working well for me!
@thomaspurk have you solved it?
Absolutely. The code I posted above is working well for me!
@thomaspurk what version of pdf-lib is your code using? I am running this on node.js 20, and pdf-lib
version 0.6.1, i got a TypeError: PDFDocument.load is not a function
, the original code that Hopding wrote does not have this issue, since it uses PDFDocumentFactory to load the pdf
@thomaspurk have you solved it?
Absolutely. The code I posted above is working well for me!
@thomaspurk what version of pdf-lib is your code using? I am running this on node.js 20, and
pdf-lib
version 0.6.1, i got aTypeError: PDFDocument.load is not a function
, the original code that Hopding wrote does not have this issue, since it uses PDFDocumentFactory to load the pdf
Node v20.11.0 pdf-lib 1.17.1
As I recall, Hopding's original code (posted elsewhere not in this issue) did not work for me. I can only assume there has been some refactoring to the module's class names over the versions between 0.6.1 and 1.17.1. The current documentation references the use of PDFDocument.load. See the examples here, https://pdf-lib.js.org/
Hi everyone ! I am trying to extract all images from a pdf page. I don't know if it is possible, but I would to do something like this website does. I am currently manipulating the pdf as follows :
const pdfDoc = PDFDocumentFactory.load('pdf/path');
const pages = pdfDoc.getPages();
const existingPage = pages[0];
Thank you four your answers :)