CIRCL / Circlean

USB key cleaner
https://www.circl.lu/projects/CIRCLean/
BSD 3-Clause "New" or "Revised" License
452 stars 70 forks source link

Sanitize PDF instead of converting to HTML #11

Open moleculezz opened 9 years ago

moleculezz commented 9 years ago

In some cases you really want to have a PDF file in the end. Especially when you are dealing with print production files that need to be in PDF format.

Is there a way to choose to sanitize instead of converting to HTML?

Rafiot commented 9 years ago

There is a PDF format called PDF/a [1] that is more sane than the normal one and removes all the active contents. My plan it to use it at some point as intermediary format before the conversion to HTML.

That's probably the one I would be using in your case.

To answer to your question, I am not aware of a way to sanitize a PDF document because there is also no know what an "insane" document is. Something that worth a try would be to convert known malicious PDF documents to PDF/A and see if the outcome is still malicious.

An other way to do it would be to convert the content of the documents as images and put them all together in a pdf file, correct me if I'm wrong but that should be fine if you want to print the documents (the reason behind the conversion to HTML is the fact the users can still copy/paste the content as text).

[1] https://en.wikipedia.org/wiki/PDF/A

moshekaplan commented 7 years ago

A while back I wrote a script based on PyPDF2 and Wand to sanitize PDFs by converting them into images and stitching them back together into a single PDF. It is only one of many possibilities: https://github.com/moshekaplan/SafePDF

Rafiot commented 7 years ago

Thanks you for the reference, but there is a problem with this approach on the default image: many of the users want to be able to copy text out of the PDFs.

If you have a usecase, I strongly recommend you to develop a dedicated script, or to give me more details on your use case so we can have a look.

moshekaplan commented 7 years ago

@Rafiot: I address that by extracting the text first and then embedding it inside of the PDF. See https://github.com/moshekaplan/SafePDF/blob/master/SafePDF.py#L53

wrickaz commented 7 years ago

Hello,

what about disarming PDFs with pdfid and copying disarmed pdf to clean flash drive instead of marking it as dangerous?

Rafiot commented 7 years ago

This is a good point, and I'm not sure about it. My approach now is that if you have a file on the usb key and you don't really know what is on it, but one of the PDF contains something dodgy, I'd rather inform the user about that fact.