caradoc-org / caradoc

A PDF parser and validator
GNU General Public License v2.0
299 stars 21 forks source link

Expected positive integer in object trailer #7

Open GhostRock37 opened 7 years ago

GhostRock37 commented 7 years ago

Hello,

I have a problem with a pdf. It is detected malformed by an antivirus and I wanted to know at what level it does not respect the pdf structure.

I also think your tool will be able to clean it. Can you tell me how? thanxs for your help !

./caradoc cleanup ../PDF_MALFORMED/KO/1/1.pdf --out ../PDF_MALFORMED/KO/1/2.pdf PDF error : Expected positive integer in object trailer at entry /Prev at offset 1872031 [0x1c909f] in file !

thats the end of the pdf : << /Pages 1 0 R /Type /Catalog >> endobj xref 1 5 0001871801 00000 n 0000000208 00000 n 0001871655 00000 n 0000000012 00000 n 0001871861 00000 n trailer << /Prev 0 /Root 5 0 R /Size 6 >> startxref 1871913 %%EOF

gendx commented 7 years ago

Thank you for your report.

The /Prev field in an xref table is supposed to be an offset in the file describing the start of the previous xref section. As such, it must be a positive or null integer. However, since the offset 0 is supposed to contain the PDF magic string starting with %PDF, it should not be zero either.

In your case, it may be that /Prev 0 is meant to say that there is no previous xref table. To clean up the file, it might be worth trying removing the /Prev field altogether (erase it or replace it with spaces). If this doesn't work, could you provide the first few lines of the file ?

We might consider adding some code or a manual an option to handle this case in the relaxed mode in the future.

GhostRock37 commented 7 years ago

Thank you for your return!

I have try to remove the /prev field, and i have another error:

pdfVersion : 1.7 Incremental updates : 0 Neither updates nor object streams nor free objects nor encryption Object count : 5 Filter : FlateDecode -> 2 times Type error : Unexpected entry /Type in instance of class content_stream in object 3

Below , the output by a dump of xref with caradoc:

trailer << /Root 5 0 R /Size 6

obj(1, 0) << /Count 1 /Type /Pages /Kids [4 0 R]

obj(2, 0) << /ColorSpace /DeviceRGB /Filter /FlateDecode /Type /XObject /Width 850 /Height 1170 /BitsPerComponent 8 /Subtype /Image /Length 1824804

stream <encoded stream of length 1824804>

obj(3, 0) << /Filter /FlateDecode /Type /Stream /Length 55

stream <encoded stream of length 55>

obj(4, 0) << /Contents 3 0 R /Rotate 0 /CropBox [0.0 0.0 850.0 1170.0] /Type /Page /Resources << /Font <<

/XObject << /Im1 2 0 R

/MediaBox [0.0 0.0 850.0 1170.0] /Parent 1 0 R

obj(5, 0) << /Pages 1 0 R /Type /Catalog

And here, the first line of the file :

%PDF-1.7 4 0 obj << /Contents 3 0 R /CropBox [0.0 0.0 850.0 1170.0] /MediaBox [0.0 0.0 850.0 1170.0] /Parent 1 0 R /Resources << /Font << >> /XObject << /Im1 2 0 R >> >> /Rotate 0 /Type /Page >> endobj 2 0 obj << /BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter /FlateDecode /Height 1170 /Length 1824804 /Subtype /Image /Type /XObject /Width 850 >> stream

Another question : i try to find a way to convert malformed pdf files into a correct pdf format. Because we received a lot of malformed pdf, We think we could convert the malformed files into a correct pdf format. Do you know of any tool or method that could achieve this conversion?

I think there will be two techniques to do that.

The first one: export the malformed pdf to a correct pdf (I realized a simple test with pdfcreator: by printing a malformed pdf to a pdf respecting the standard pdf/a, the resulting file is in a correct format. What seems interesting with this technique is that the polyglot files become simple pdf. It's very interesting from a security point of view.

The second: parse the pdf file malformed and correct errors then export.

What do you think ? Do you know of this type of tool? Can it be transposed in a web environment (example: convert a pdf while upload?)

From a security point of view, files that will be converted to a pdf/a format should be clean and no longer have an antiviral threat?

thanxs !

gendx commented 7 years ago

It looks like the first explanation was correct in your case (i.e. there should not be a /Prev field because there is no previous xref table).

I have try to remove the /prev field, and i have another error:

You now have a type error in an object of type "content_stream". The error seems legitimate because the specification does not define a "/Type" field for this type. Also, bear in mind that caradoc aim at being a strict validator (e.g. to avoid any ambiguities), but that a lot of PDF-producing software are not so strict and type errors/inaccuracies are not uncommon.

Besides, this is still a beta version, i.e. the type system does not yet implement all of the 700+ pages of the PDF specification, which requires a large amount of work: the specification describes everything in a natural language (English text) and we have to convert it into a formal language. Even though the most common types are already implemented, you will probably end up with a type error/warning if your PDF input is a bit complex.

Another question : i try to find a way to convert malformed pdf files into a correct pdf format. Because we received a lot of malformed pdf, We think we could convert the malformed files into a correct pdf format. Do you know of any tool or method that could achieve this conversion?

Caradoc is a good start to clean up the syntax. However, we do not modify the higher-level content (at least for now), to preserve the semantics of the file and avoid inadvertently destroying legitimate features. So yes another converter (e.g. "printing" towards PDF/A) can be a complement to remove all kinds of features.

One day we might implement in Caradoc a more thorough converter that only keeps the core graphical content (similarly to the "print" feature that you mention).

Also, bear in mind that some errors are ambiguous, e.g. they are interpreted differently by distinct PDF readers. In that case, the choice made by Caradoc is to reject the file as "unrecoverable".

I think there will be two techniques to do that.

The first one: export the malformed pdf to a correct pdf (I realized a simple test with pdfcreator: by printing a malformed pdf to a pdf respecting the standard pdf/a, the resulting file is in a correct format.

If you trust pdfcreator for being robust against malformed files it's also a good start.

What seems interesting with this technique is that the polyglot files become simple pdf. It's very interesting from a security point of view.

In principle, "caradoc cleanup" gets rid of polyglot files, by converting the low-level syntax. But the original polyglot needs to be close enough to a PDF file for the normalizer to work. It depends if you want a large coverage and accept weird polyglots or be more strict about the inputs you get.

The second: parse the pdf file malformed and correct errors then export.

I don't really understand what you mean here. Correct the errors manually?

What do you think ? Do you know of this type of tool? Can it be transposed in a web environment (example: convert a pdf while upload?)

There's no reason why it shouldn't work in a web environment. But the converter must be robust enough to not become a threat itself.

From a security point of view, files that will be converted to a pdf/a format should be clean and no longer have an antiviral threat?

PDF/A is a subset of the specification that may be relevant, but similarly to input restrictions of Caradoc that can be a problem, PDF/A may damage interesting files, depending on your use-case / the features you want to support.

Also, PDF/A conversion is somewhat orthogonal to the syntax sanitisation done by Caradoc, as PDF/A cares mostly about higher-level features (e.g. embed all fonts inside the file) (I am not an expert in PDF/A though, as it is yet another quite large specification). So "PDF/A printing" and "caradoc cleanup" are complementary operations.

Thanks again for your feedback!