claird / PyPDF4

A utility to read and write PDFs with Python
obsolete-https://pythonhosted.org/PyPDF2/
Other
328 stars 61 forks source link

Wanted for testing: PDF files with specific features #20

Open acsor opened 5 years ago

acsor commented 5 years ago

Some of the unit tests I have developed rely on PDF files that have certain features. In Calibre, I own a collection of 109+ PDF books, but amongst them I haven't met any that satisfy certain needs. In particular, I'm looking for:

  1. A PDF file with a /ASCIIHexDecode, or equivalently /AHx stream filter.
  2. A PDF file with a /JPXDecode stream filter.
  3. More PDF files whose objects have /Type equal to /ObjStm, that is to say files that rely on Cross-Reference streams (PDF 1.5+).
  4. A few other hybrid-reference files, as described in section 7.5.8.4 of ISO 32000: files that use a Cross-Reference Table to hide elements stored in a Cross-Reference Stream, understandable by PDF 1.5+ readers only.

The reason of this request is to satisfy the fixture data collection (in tests/fixture_data/ of my current PR #14) of the project. It seems a rarity to find a PDF file with these characteristics and I ask you.

I have performed my searches with a simple grep. For example, in case 2 I went like so:

grep -RPi --binary-files=text [--exclude-dir=<whatever you want>] "/JPXDecode" <arbitrary path>
dreua commented 5 years ago

1. Files containing "/ASCIIHexDecode"

I found three of these, all scans of books I got from university that someone else created, but I don't feel comfortable publishing them. I could send them to you over a private channel if that would be any help.

asciihexdecode.pdf is one page that I extracted from one of these documents using pdfarranger with PyPdf2.

2. Files containing "/JPXDecode"

3. /Type equal to /ObjStm

How can I grep for those? The 00-LV-Info-SS2018-speicheropt-3.pdf contains <</Filter/FlateDecode/First 14/Length 343/N 2/Type/ObjStm>>stream, is this sufficient?

4. Other hypbrid-reference files

How would I identify those if I had them?