galkahana / HummusJS

Node.js module for high performance creation, modification and parsing of PDF files and streams
http://www.pdfhummus.com
Other
1.14k stars 169 forks source link

Streams - 269kb pdf becomes 10Mb #247

Closed richard-kurtosys closed 6 years ago

richard-kurtosys commented 6 years ago

Hi,

I have a 269Kb pdf on S3 that I'm retrieving into a buffer using s3.getObject. I'm using the following code to create the inputs and outputs

var inStream = new PDFRStreamForBuffer(document.file.Body); //document.file.Body contains the buffer
var outStream = new hummus.PDFStreamForResponse(res); //res is my expressjs res
const pdfWriter = hummus.createWriterToModify(inStream, outStream,);

After the call to hummus.createWriterToModify , inStream has inStream.fileSize = 268816 and inStream.rposition = 253966.

The problem is outStream.position is 10485760.

The pdf get's manipulated and displays correctly but is now a 10Mb file.

I went through the the source and found the following: https://github.com/galkahana/HummusJS/blob/master/src/deps/PDFWriter/PDFWriter.cpp#L710 EStatusCode status = traits.CopyToOutputStream(inModifiedSourceStream);

CopyToOutputStream is here: https://github.com/galkahana/HummusJS/blob/master/src/deps/PDFWriter/OutputStreamTraits.cpp#L36 Is there a way to copy just the buffer size and not create a 10Mb object? Or is there a different solution to this?

Thank you for the assistance!

richard-kurtosys commented 6 years ago

If I save the S3 buffer to a file on disk and then use:

var inStream = new hummus.PDFRStreamForFile("s3file.pdf");  // Only this changes
var outStream = new hummus.PDFStreamForResponse(res);
const pdfWriter = hummus.createWriterToModify(inStream, outStream,);

then everything works correctly and inStream and outStream both have a size of 268816 and the file downloads as 287kb (which is correct as I apply a watermark to it).

pwinkelm commented 6 years ago

Hello Richard, maybe you can help me with a question I had.

My problem is, I receive from an interface a pdf as a base64 string and I need to append some base64 stringed images to it. So I think I first need to write the pdfbase64 string to a file but I donot know how to do that.

The document.file.Body you have is a base64 string? How did you save that buffer to a file on disk?

Thank You!

richard-kurtosys commented 6 years ago

Hi @pwinkelm

The document.file.Body I have is an actual buffer - https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#getObject-property

I used:

function writeFile(filename, buffer) {
    return new Promise((resolve, reject) => {
        fs.writeFile(filename, buffer, (error, data) => {
            if (error) {
                return reject(error);
            }
            return resolve(data);
        });
    });
}

await writeFile("filename.pdf", document.file.Body);

But you should be able to get away with something like: http://www.codeblocq.com/2016/04/Convert-a-base64-string-to-a-file-in-Node/

richard-kurtosys commented 6 years ago

Back to the origional issue, would it make sense to pass in the buffer size to
EStatusCode status = traits.CopyToOutputStream(inModifiedSourceStream);? here https://github.com/galkahana/HummusJS/blob/master/src/deps/PDFWriter/PDFWriter.cpp#L710

inModifiedSourceStream should have a fixed size as it would be either a file or a buffer (not a stream). My C isn't good enough to be able to test and try this :(

richard-kurtosys commented 6 years ago

I did some more research into this, trying different buffer sizes and running npm install --build-from-source to generate a new /node_modules/hummus/binding/hummus.node.

I came up with the following:

Input filesize          10618
Buffer size                             Average time in ms  Output filesize

1               createWriterToModify    41,58715            29239
1024            createWriterToModify    5,50785             29885
65535           createWriterToModify    20,8952             84157
10485760        createWriterToModify    3168,794            10504384

Input filesize          18299457
Buffer size                             Average time in ms  Output filesize

1               createWriterToModify    Runs out of memory  ---
1024            createWriterToModify    6395,38055          20276115
65535           createWriterToModify    6550,6411           20326291
10485760        createWriterToModify    7397,794            22947731

The first set is for the time to run createWriterToModify on a 10k pdf. The second set is for the time to run createWriterToModify on a 18Mb pdf.

The very left has what I had the buffer size set to and the very right is the final output file size.

The quickest appears to be when the buffer size is set to 1024. This also gives the second smallest filesize.

Based on this, would it be possible to change https://github.com/galkahana/HummusJS/blob/master/src/deps/PDFWriter/OutputStreamTraits.cpp#L36 to 1024 and not 10*1024*1024 ?

galkahana commented 6 years ago

use 1.0.85 version of PDFRStreamForBuffer. it's the cause of error, and so where the correction is to be made.

richard-kurtosys commented 6 years ago

Awesome @galkahana - I'm looking forward to testing and trying this out!

:)

Hopefully I'll have some feedback in a day or 2 and I'll paste in some results.

richard-kurtosys commented 6 years ago

Confirmation that this has been fixed in 1.0.86 :) Thank You Very Much @galkahana

For the incoming buffer be use to use new hummus.PDFRStreamForBuffer, before I was using new PDFRStreamForBuffer.

var inStream = new hummus.PDFRStreamForBuffer(document.file.Body); 

Stats for 1.0.86 are:

Input filesize          10618  (1 page)
Buffer size                             Average time in ms  Output filesize

(not changed)   createWriterToModify    5.38735             29239

Input filesize          18299457 (1310 pages)
Buffer size                             Average time in ms  Output filesize

(not changed)   createWriterToModify    6727.76175          20275668