caradoc-org / caradoc

A PDF parser and validator
GNU General Public License v2.0
300 stars 21 forks source link

Save all stream with FlateDecode filter #5

Closed Chimrod closed 8 years ago

Chimrod commented 8 years ago

This commit rewrite all streams which have been succesfully decoded using a FlateDecode filter.

Rewriting all the streams ?

There are many streams allowed in a pdf document, some binary, some of them in plain text. The PDF specs allow the use of any filter for any stream, which can allow to hide malicious scripts inside a document.

Using a single filter in the cleaned document gives some protection with those attacks.

Why FlateDecode ?

Because it is gives a compression with a control checksum, and is non destructive : it can be applied at any place without modification in the document.

Additionnaly stuff

This commit also give a full explanation on stream which has been decoded and those which could'nt : only one represention is keep in memory instead of the both as initially. If the steam could be decode, only the decoded one is available.

This will allow later developement for parsing stream content and check errors inside the stream :

gendx commented 8 years ago

Hello,

While I agree that adding an option to re-encode all streams is a good idea, I see some caveats.

So I think that keeping the raw encoded stream in all cases is still useful for now, but adding an option to manually re-encode streams with /FlateDecode is a good idea (along with an option to simply decompress everything). I also plan to add support for decoding RunLength and ASCII85 filters is the next days.

Regarding content streams, I totally agree and I also started a prototype to parse them, I will push it once it is mature enough.

Also, be careful not to add executable permission to source files with git (e.g. src/data/pdfobject.ml is in "100755" mode in your commit).

Chimrod commented 8 years ago

Thanks for your detailed answer. I was not sure if the project accept external contributions.

Using a filter or not does not change the document's content : it's only an (arbitrary) normalisation choice, in order to help externals tools to analyze the pdf without surprises.

Completely removing filters could be a good solution on paper, but this would produce documents with a size to heavy for a real usage. (I think you just suggest this idea without consider it as a good solution too, I do not develop this point).

The problem this the actual stream api is that the raw content and the decoded raw can be completely different (as seen in some tests). Keeping two differents representation from the same object is not a good practice, and OCaml syntax can help us to ensure that the stream is available for parsing or not :

type stream
| Raw of string
| Decoded of string
| Offset of int

explicitely declare if the stream is the original one or has been decoded by Caradoc. But, if you want to keep a track of original stream, I can make the proposition for this structure :

type stream
| Raw of string
| Decoded of string * (Pdfobject.dic * BoundedInt.t) (* decoded stream, and original Dictionary + offset in the document *)
| Offset of  BoundedInt.t

which does not lost the information about the raw data (without keeping two string in memory) ; this could also answer the point to rewrite a stream which was already this flatedecode filter.

You got the point for the Predicator : in my implementation I have to remove the key from the dictionnary in order to prevent invalid document.

If you are interested in integrating my changes, I can try update the code for keeping a raw stream for image and font stream, and keep a decoded stream for page content only, with the remarks above.

gendx commented 8 years ago

I think that a cleaner design would be to encapsulate the stream type into a module Stream (that would merge with the current src/stream/parsestream.ml). This would allow to cache the two representations internally (to save encoding/decoding time) but the public interface would consistently expose encoded or decoded content.

This requires refactoring a bit the PDFObject module, i.e. separate it into "basic objects" (everything but streams) and "indirect objects". This is also cleaner because arrays or dictionaries can only contain "basic objects" (whereas the current representation does not protect against manually putting a stream in them).

Also, the Offset constructor is only a temporary representation used at parsing stage (in the relaxed parser when the dictionary was parsed but the /Length field was not extracted yet), so we can separate it from the stream type.

Last, regarding memory usage, the decoded representation of a stream should in general be larger than the encoded representation. So contrary to what you propose, I think that keeping both strings in memory should not be a problem (in practice it's dozens of MB for large PDFs). On the other hand, keeping an offset in the file is not very clean IMHO (when do you close the original file? what if the file disappears or is modified in the meantime?).

So we roughly come up with the following types:

type basic_t =
| Null
| Int of BoundedInt.t
| ...

type dict_t = basic_t dict

(* representation not exposed directly to the public interface *)
type stream_t = {
    d : dict_t;
    encoded : string;
    decoded : string option;
}

type indirect_t =
| Basic of basic_t
| Stream of stream_t

type incomplete_t =
| Complete of basic_t
| StreamOffset of dict_t * BoundedInt.t

I am working on this.

Chimrod commented 8 years ago

Althougth it's common usage in a pdf file to define one object for one token, the pdf structure allow to embed any object in a dictionnary or in an array (except for some explicitly declared cases, like the /Root entry in a trailer dictionnary), even a stream may be embeded inside one object. How would you represent such structures ? I think the actual structure for token, which places a stream at the same position than Null or Int is good. And this also allow to encapsulating a stream in it's own module :

(** module pdfobject *)
  type t =
    | Null
    | Bool of …
    | Int of …
    | Real of …
    | String of …
    | Name of …
    | Array of …
    | Dictionary of …
    | Reference of …
    | Stream of Stream.t

Internally, if have no opinion about how represent the stream, as long as the module provide some functions for extracting the decoded stream easily.

I just suggest to use lazy instead of option inside the stream structure :

(* representation not exposed directly to the public interface *)
type stream_t = {
    d : dict_t;
    encoded : string;
    decoded : string lazy;
}

in order to decode only required streams.

If you are working on it, I do not continue my work in this direction and close this pull request. But I'm interested by the project I would be happy to give you some help if you need it :-)

gendx commented 8 years ago

Althougth it's common usage in a pdf file to define one object for one token, the pdf structure allow to embed any object in a dictionnary or in an array (except for some explicitly declared cases, like the /Root entry in a trailer dictionnary), even a stream may be embeded inside one object.

According to the specification (7.3.8.1):

All streams shall be indirect objects

So you cannot embed directly a stream inside e.g. an array (but you can use an indirect reference for this purpose).

How would you represent such structures ?

Since such structures would be ill-formed, we do not want to represent them (and the parsers of Caradoc do not accept them). To the best of my knowledge, no "useful" PDF created by common software contains embedded streams like this. So files with such structures are probably hand-crafted and/or malicious :-) And even though some PDF readers might accept them, there is probably an ambiguity of interpretation at this level !

If you are working on it, I do not continue my work in this direction and close this pull request. But I'm interested by the project I would be happy to give you some help if you need it :-)

Thanks ! Your suggestions and bug reports are already valuable ! Other than that, you could also have a look at the TODOs pending in the sources and the test cases.

gendx commented 8 years ago

Your proposal is implemented in version 0.2, but you can choose which filter you want to use. See this folder for examples : https://github.com/ANSSI-FR/caradoc/tree/master/test_files/cleanup