Cisco-Talos / clamav

ClamAV - Documentation is here: https://docs.clamav.net
https://www.clamav.net/
GNU General Public License v2.0
4.31k stars 697 forks source link

clamav does not extract all images from PDF Files #1038

Open JAF84 opened 1 year ago

JAF84 commented 1 year ago

hello,

i often create signatures of extracted picture of pdf. i create if from the images, which clamscan creates in the tempdir with a command like this:

clamscan -z -d ~custom-sigs/ --debug --leave-temps=yes --tempdir=tmp/ 1.pdf

this is very usefull, because i lot of bad emails with pdf oft uses the same images again.

but take a look at the sample i attached. here clamav does extract text and also font files, but no images.

when you use pdfimages from poppler-utils:

pdfimages -list pfizer\ -\ request\ for\ quotation.pdf page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio 1 0 image 500 378 rgb 3 8 image no 10 0 221 220 24.0K 4.3% 1 1 stencil 264 23 - 1 1 ccitt no [inline] 233 122 - - 1 2 image 400 400 index 1 8 image no 12 0 113 117 51.2K 33%

you also see this images and can extract them to PNG.

i believe this is maybe a pdf-specify format or in some further container?

do you think you could modify clamav, so it also extract/checks this images?

thanks

br johannes pfizer - request for quotation.pdf

micahsnyder commented 1 year ago

It's funny you should ask about this. This specific issue is pretty much the next thing on our to-do.

I had done some early research into it to figure out what was going on and yeah I have no idea what the file format is. It seems PDF-specific. I'd love if someone else could tell me! 😆

JAF84 commented 1 year ago

Hello,

here is the PDR Reference: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf

I've attached the uncompressed pdf-file, so it's easier to read it...

2 types are in the file:

Masked Images: Described in Page 350 "4.8.5 Masked Images": Content in Line 678++ and Line 837++

Inline Image: Described in Page 352 "4.8.6 Inline Images" Line 77 (ID 1...)

The format is PDF-specific, this is not an PNG, JPEG or something else.

br johannes

uncompressed.pdf

micahsnyder commented 1 year ago

Thanks @JAF84 this looks like it will be very helpful. I'll add it to our internal Jira as well for reference when we resume work on it.

JAF84 commented 12 months ago

hello,

look here, this online tool is also describing the parts of the PDF: https://www.metadata2go.com/result#j=f53e9f31-d923-418b-973f-6bf1981a85f6

it telling us 2 images: 1 time "type image" and 1 time "type stencil"

br johannes

micahsnyder commented 10 months ago

Work on this has been postponed to try to meet a different deadline. It's still planned, but is delayed. Sorry!