JoshData / pdf-redactor

A general purpose PDF text-layer redaction tool for Python 2/3.
Creative Commons Zero v1.0 Universal
180 stars 61 forks source link

Arrays and dictionaries in page content may be split across two stream objects #1

Closed divergentdave closed 7 years ago

divergentdave commented 7 years ago

I tried processing a pdf I had lying around, and I got an IndexError in tokenize_stream. The root cause is that a dictionary in the content stream is split between two stream objects, and thus two invocations of tokenize_stream, so the << token gets thrown away before the end of the dictionary is parsed. Fixing this will require keeping the stack around between stream objects. I'll take a crack at a PR to do so.

Test case:

curl https://www.ncua.gov/About/Pages/inspector-general/audit-reports/Documents/ncua-report-cybersecurity-act-aug-10-2016.pdf > a.pdf
qpdf --stream-data=uncompress --pages a.pdf 1 -- a.pdf b.pdf
python example.py < b.pdf