galkahana / HummusJS

Node.js module for high performance creation, modification and parsing of PDF files and streams
http://www.pdfhummus.com
Other
1.14k stars 169 forks source link

Valid PDF has invalid/blank pages after appending all pages in new document #254

Open tsestrich opened 6 years ago

tsestrich commented 6 years ago

Hi,

I am getting an issue where certain PDF files that are valid (as in, all pages display properly in Acrobat/Chrome browser), but result in an erroneous PDF file after processing. All that I am trying to do is to concatenate PDF files through the appendPDFPagesFromPDF function, as show below. This works for most documents, but certain files result in blank pages when run through this process. When opened in Acrobat, I receive an error prompt when trying to view the blank pages.

See that attached files for the "before" and "after": Test Before.pdf Test After.pdf

I tried to check logs, but no logs are being written (which, as indicated in another ticket here, suggests that Hummus is not throwing any errors).

The code I am using is below:

try
    {
        var pdfWriter = hummus.createWriter(targetFileLocation, { 
            version: hummus.ePDFVersion17,
            log: "combineTest.log"
        });

        fileGrouping.Files.forEach(function(file){
            var docUniqueId = file.docUniqueId;

            var tempPDFFileNameIn = tempDirectoryName + "/" + docUniqueId + ".pdf";
            pdfWriter.appendPDFPagesFromPDF(tempPDFFileNameIn);
        });

        var newInfo = pdfWriter.getDocumentContext().getInfoDictionary();

        newInfo.title = fileGrouping.title;

        pdfWriter.end();
    }
    catch(error)
    {
        logService.LogError(error);
    }
tsestrich commented 6 years ago

I should also say that I am open to suggestions for better ways to simply concatenate PDF files (if I can avoid doing page-by-page).

chunyenHuang commented 6 years ago

You may take a look here https://github.com/galkahana/HummusJS/wiki/Embedding-pdf

For invalid pdfs, you may use the reader before appending

    var tempPDFFileNameIn = tempDirectoryName + "/" + docUniqueId + ".pdf";
    var pdfReader = hummus.createReader(tempPDFFileNameIn);
    ...
tsestrich commented 6 years ago

Thank you for the suggestion! I'm experimenting with this now, but I'm not sure what I need to do after reading it in. I thought maybe I was supposed to try streaming from the reader into the writer, but I don't think I'm doing it right. The following doesn't work (I get unable to append page, make sure it's fine):

var tempPDFFileNameIn = tempDirectoryName + "/" + docUniqueId + ".pdf";
var pdfReader = hummus.createReader(tempPDFFileNameIn);

pdfWriter.appendPDFPagesFromPDF(pdfReader.getParserStream());

What are you suggesting I do after reading in the source file?

tsestrich commented 6 years ago

I also tried the following, taking another approach from the "Embedding" guide you linked, and still getting the same blank pages:

            var tempPDFFileNameIn = tempDirectoryName + "/" + docUniqueId + ".pdf";

            var cpyCxt = pdfWriter.createPDFCopyingContext(tempPDFFileNameIn);
            var pageCount = cpyCxt.getSourceDocumentParser().getPagesCount();

            for(var i=0; i<pageCount; i++)
            {
                cpyCxt.appendPDFPageFromPDF(i);
            }

@galkahana any chance you could help me figure out what about the original PDF is causing issues when it is being copied/appended? See the original examples in my first post.

galkahana commented 6 years ago

The problem is with the appending algorithm in HummusJS. In PDF the pages may use resources. Those resources may be defined via their own resources dictionary, or inherited from a parent resource dictionary (in case one does not exist for the page). The appending algorithm ignores inherited resources. The solution is to either implement a better mech in JS or fix the problem in Hummus Code. I implemented a correction in code and will publish it soon.

galkahana commented 6 years ago

1.0.84 should solve the problem