JoshData / pdf-redactor

A general purpose PDF text-layer redaction tool for Python 2/3.
Creative Commons Zero v1.0 Universal
180 stars 61 forks source link

Inline images #9

Closed divergentdave closed 7 years ago

divergentdave commented 7 years ago

This handles inline images in content streams. Each combination of color space, compression filter, and encoding filter will require special handling to determine how long the image data is. For now, this just supports uncompressed 1-bit image masks.

JoshData commented 7 years ago

Wow. Really nice solution.

divergentdave commented 7 years ago

Well, mostly. I'm still getting errors on a smaller number of CRS report PDFs. Some inline images have too many bytes of binary data, so the end of that shows up when I was expecting an EI. Looking at other implementations, it seems I might have to scan for the correct "whitespace, EI, whitespace, any ASCII" sequence to resynchronize with the content stream.

divergentdave commented 7 years ago

Okay, this is good to go now. There were a total of six CRS reports where inline images didn't have the right length. (half longer, half shorter) There was even a 1x1 image with zero bytes of image data! This search algorithm is in line with what Poppler and pdf.js do.

JoshData commented 7 years ago

Huh. Wow. Ok merging!