caradoc-org / caradoc

A PDF parser and validator
GNU General Public License v2.0
300 stars 21 forks source link

Caradoc does not support dictionnary object in DecodeParms #6

Closed Chimrod closed 8 years ago

Chimrod commented 8 years ago

Hello,

This dictionnary does not seems to be supported by caradoc :

23 0 obj
<<
/Subtype /Image
/ImageMask true
/Width 563
/Height 310
/BitsPerComponent 1
/Filter /CCITTFaxDecode
/DecodeParms 32 0 R
/Length 33 0 R
/Type /XObject
>>
stream
…
endstream

32 0 obj
<<
/K -1
/Columns 563
>>
endobj

The cleanup raise this error :

PDF error : Invalid value for stream /DecodeParms for object 23 at offset 112531 [0x1b793] !

Chimrod commented 8 years ago

Edit : the problem is in the parsestream, which expect to get an array.

gendx commented 8 years ago

Hello,

The problem is that indirect objects are not currently supported in this case, because this requires fetching the object in the document, which is more complex and can lead to a loop (what if the indirect object is located inside the stream that we are now parsing?). However, I am thinking of a fix for the legitimate cases (such as yours).

Note that here, the get_array_of function with the parameter ~accept_one:true expects either an array or an object (in which case it is interpreted as an array of length one). This is weird but it was a choice of the PDF specification (see table 5 of section 7.3.8.2).

Then, the ~transform parameter indicates that we expect array elements to be dictionaries. So the problem is not with the array but with the indirect reference. I will clarify the error message to indicate what the problem is.

Chimrod commented 8 years ago

Fetching an object does not need to be recursive, and the pdf does not support multiple indirections. There is no risk to loop indefinitely.

More generally, the only one objects which could refer to external objects are : Array, Indirect Objects, Stream and Dictionnary.

Array : an array cannot refer to itself, this is a structural error. Indirect Objects : This would cause a type error Stream : we can figure that a entry from the dictionnary refer to itself, but this would cause a type error : because the stream itself would not be from the desired type.

From thoses types, only the dictionnary could refer itself without causing an error (only the keys from required type are interpreted and extra keys should be ignored ; /Type is not mandatory)

gendx commented 8 years ago

Fetching an object does not need to be recursive, and the pdf does not support multiple indirections. There is no risk to loop indefinitely.

I think that you missed the object streams (section 7.5.7). It is possible that a stream dictionary refers to an object that is contained inside this stream:

1 0 obj
<<
   /Length 2 0 R
   ...
>>
stream
...
... <- this stream is an object stream and contains object #2
...
endstream

Besides, a loop is not necessarily an object that directly refers to itself, there can be several objects involved. Some common PDF readers have been fooled by these kind of recursive structures.

And what do you exactly mean by structural and type errors? Which part of the specification?


the pdf does not support multiple indirections

By "support", do you mean common software or the specification? Actually, the specification is not very clear about it (7.3.10) :

Except were documented to the contrary any object value may be a direct or an indirect reference; the semantics are equivalent.

This does not forbid an indirect object to consist itself of an indirect reference, although implementations seem to reject double indirection.

Chimrod commented 8 years ago

By type error, I mean two objects which cannot be unified to the same type. Exactly like :

1 0 obj
<<
   /Length [0 0 595.2800292969 841.8900146484]
>>
endobj

could raise an error like (formated in OCaml ways) :

Error: This expression has type int list but an expression was expected of type int

This is not related to a specification, but to an error witch typed data.

By structural error, I mean an error related to the document construction, like

1 0 obj
<<
[1 0 R]
>>
endobj

which is a valid syntax, and a valid type (the object is an array of indirect object of indefined type), but it leads to incorrect structure.


I'm not familiar with PDF>1.4 and took a look in Objects Streams, but the structure does not seem to allow references to itself. The exemple you give is not valid, it should be :

1 0 obj
<<
   /Length 2 0 R
   ...
>>
stream
...
... <- this stream is an object stream and is object #1, object 2 must be describe elsewhere
...
endstream

(and the Length value itself cannot be contained inside an Object Stream)


About the number of indirection, an indirect object can itself contains another one and this is allowed by the specification. Actually, the indirect type can be type with this structure :

  type t =
    | Null
    | Bool of bool
    | Int of int_t
    | Real of real_t
    | String of string
    | Name of string
    | Array of t list
    | Dictionary of t dict
    | Reference of t (* a reference has the type of the data contained *)
    | Stream of t dict * stream_t

The specification says that a type 'X Reference can be unified with type 'X, but that's all ; the Reference itself remains a distinct type (and this give sense to objects which are explicitely required to be Indirect reference, like the Root entry in a trailer).

gendx commented 8 years ago

For the type error, this is actually what Caradoc does in the type checking phase. But the type checker is only run after the parser, and we need the /Length and /DecodeParms during the parsing phase. The current doc does not detail all these validation phases yet, but we will explain them in a few weeks at the LangSec workshop, and update the doc soon.

I'm not familiar with PDF>1.4 and took a look in Objects Streams

It is quite complex but basically allows to embed several objects into another object (which itself is a stream), in order to compress them. So even though it is not sound to embed inside an object stream the value of its /Length or /DecodeParms, an attacker could do it on purpose to fool the parser. This means that the parser has to somehow track objects to avoid infinite recursion. Here is another example, where each stream contains the parameters for the other stream

1 0 obj
<<
    /Length 3 0 R
    ...
>>
stream
...
... <- this object stream here contains object #4
...
endstream
endobj

2 0 obj
<<
    /Length 4 0 R
    ...
>>
stream
...
... <- this object stream here contains object #3
...
endstream
endobj

A naïve parser would first try to decode object 1, for this purpose needs to fetch object 3 to obtain the length of the stream. Then it would see that it's embedded in object 2 so it has to decode it, and for this purpose needs to fetch object 4, which is in object 1. To avoid these kind of problems we need to mark which objects were already traversed (and reject the file when we detect a loop).

This is one of the reasons why we proposed a simplified format (see the grammar), that e.g. rejects all object streams, because the parser is simpler and easier to verify than a feature-complete parser. It does not cover most PDFs, so we also propose a relaxed (and more complex) parser and a normalizer.


Coming back to references, what about the following examples?

1 0 obj
<<
    /foo 2 0 R
    ....
>>
endobj

2 0 obj
    3 0 R
endobj

3 0 obj
    12345
endobj

and with infinite recursion:

1 0 obj
<<
    /foo 2 0 R
    ....
>>
endobj

2 0 obj
    2 0 R
endobj

In practice, common readers accept neither of these.

gendx commented 8 years ago

This should be fixed in version 0.2 (for the non-strict parser only).