galkahana / HummusJS

Node.js module for high performance creation, modification and parsing of PDF files and streams
http://www.pdfhummus.com
Other
1.15k stars 170 forks source link

Unable to modify some pdfs that were generated by scanner #245

Open ziopads opened 6 years ago

ziopads commented 6 years ago

I've run into an issue with modifying certain pdf files. The issue seems happen only with pdf files that were generated by a scanner. I'm successfully adding serial numbers to the bottoms of every page of a document using the code below, which closely follows the documentation. With scanned documents, however, the number does not appear. So far, this issue of missing text has manifested with only these pdf files that were generated from a scanner.

  const pdfWriter = hummus.createWriterToModify(writeObj.localLocation, {
    log: path.resolve(__dirname, './hummus.txt'),
    modifiedFilePath: `./tmp${writeObj.projId}/${writeObj.filename}.pdf`,
  });

  const arialFont = pdfWriter.getFontForFile(path.resolve(__dirname, './arial.ttf'));
  const textOptions = {
    font: arialFont,
    size: 12,
    colorspace: 'rgb',
    color: 0x262673,
  };

  for (let i = 0; i < writeObj.pageTotal; i += 1) {
    const pageModifier = new hummus.PDFPageModifier(pdfWriter, i, true);

    pageModifier
      .startContext()
      .getContext()
      .writeText(
        `${writeObj.prefix} ${writeObj.numFirst + i}`,
        275, 10,
        textOptions,
      );

    pageModifier.endContext().writePage();
  }

  pdfWriter.end();
}

module.exports = writePdf;

It's worth noting that I haven't been able to generate a log file successfully; I think I'm following the instructions for generating it but none of my efforts to produce it have been successful. I've console logged and experimented with try-catch blocks and I don't see any indication that the .writeText method is failing.

Is it possible that there is something in the graphic context for these pdfs that prevents .writeText from working normally? I suspect that the .writeText method is working just as it does for all non-scanned pdfs but that there is something about the graphic context for the scanned pdfs that is preventing the display of the text. I don't know if there is such a thing as a z-index for pdfs, but it seems like the scanned image is probably "on top" of the text in the pdf. I note the following passage in the documentation for the .endContext method:

Note that you can call this method and call startContext later again to create a new context for the same page. This allows you to stop writing for a page in order to add graphics outside of the graphic context, similar to the usage logic of pdfWriter.pausePageContentContext(cxt)

Is there a way I can create a context other than the graphic context that "superimposes" my text on top of the graphic context that has the scanned image? It's not clear to me how to do that, since .startContext returns the PDFPageModifier instance, and not the context. The .getContext method seems to be the only option, but that just gives me the graphic context that seems to be creating the problem in the first place.

Alternatively, is there another way I can work around this issue, perhaps by copying the scanned pdf into a new pdf, as illustrated in #98 ?

chunyenHuang commented 6 years ago

Well, it may be easier than you think. If I were you, I will check the these things first

  1. Is the page viewport from 0,0?
  2. Is the page rotated?
  3. Use xObjectForm

You can use the following script to find out the rotation and viewport

    for (let i = 0; i < writeObj.pageTotal; i += 1) {
        const pageModifier = new hummus.PDFPageModifier(pdfWriter, i, true);

        const pageInfo = pdfWriter.createPDFCopyingContextForModifiedFile()
            .getSourceDocumentParser()
            .parsePage(i);
        console.log(pageInfo.getMediaBox());
        console.log(pageInfo.getRotate());
    }

Usually the media box should look something like [ 0, 0, width, height] but it's not always true, especially from the scanned files (it's a curse lol~). So you will have to modify your codes to write the text from the "real" left bottom corder.

    const mediaBox = pageInfo.getMediaBox();
    pageModifier
      .startContext()
      .getContext()
      .writeText(
        `${writeObj.prefix} ${writeObj.numFirst + i}`,
       mediaBox[0]+275, mediaBox[1] + 10,
        textOptions,
      );

For rotation, see #167.

For xObjectForm, see #242.

Good luck!

christiansaiki commented 6 years ago

@ziopads did you manage to solve your issue?