caradoc-org / caradoc

A PDF parser and validator
GNU General Public License v2.0
300 stars 21 forks source link

Caradoc - a PDF parser and validator

Caradoc is a parser and validator of PDF files written in OCaml. This is version 0.3 (beta).

Caradoc provides many commands to analyze PDFs, as well as an interactive user interface in console.

Caradoc was presented at the the third Workshop on Language-Theoretic Security (LangSec) in May 2016. More information is available on the website of the conference.

Dependencies

Along with an OCaml compiler, this program depends on the following libraries:

The prefered way to install dependencies is via opam, the OCaml package manager. The following commands give an example of installation.

apt-get install ocaml opam
apt-get install zlib1g-dev libgmp-dev pkg-config m4
opam init
opam install ocamlfind
opam install cryptokit ounit menhir curses

It is also possible to use the corresponding Debian packages.

apt-get install ocaml zlib1g-dev ocaml-findlib libcryptokit-ocaml-dev libounit-ocaml-dev libcurses-ocaml-dev menhir

Installation

After installing the dependencies, just type make to compile the code. You may want to run make test to check that the program runs properly on your architecture and versions of OCaml and OPAM.

Examples

Command line

To obtain simple statistics on a PDF file, just type:

caradoc stats path/to/your/input.pdf

To validate a PDF file, check the exit code of:

caradoc stats --strict path/to/your/input.pdf

To normalize a PDF file into the strict syntax:

caradoc cleanup path/to/your/input.pdf --out path/to/your/output.pdf

To print the xref table(s):

caradoc xref path/to/your/input.pdf

To print the trailer(s):

caradoc trailer path/to/your/input.pdf

To extract a specific object, given its object number (and generation number, defaulted to zero):

caradoc object --num 2 path/to/your/input.pdf
caradoc object --num 2 --gen 5 path/to/your/input.pdf

To extract complex data in a single command (xref table, dump of all objects, list of types, graph of references):

caradoc extract --xref <xref output file> --dump <objects output file> --types <types output file> --dot <graph output file> path/to/your/input.pdf

To print the list of PDF types handled by this version of Caradoc:

caradoc types

To find all references to an object, given its object number (and generation number, defaulted to zero):

caradoc findref --num 2 input.pdf
caradoc findref --num 2 --gen 5 --show --highlight input.pdf

To find all occurrences of a PDF name:

caradoc findname --name Page input.pdf
caradoc findname --name Resources --show --highlight input.pdf

Interactive mode

You can also try the interactive user interface in console:

caradoc ui path/to/your/input.pdf

Ad-hoc parser options

You can specify an option file as parameter of most commands in the relaxed parser mode:

caradoc stats --options path/to/option/file input.pdf

You need to put one option per line in the option file.

The following options are defined, to cope with common errors produced by various PDF software.

Structure of the code

The source code is organized as follows: