Cisco-Talos / clamav

ClamAV - Documentation is here: https://docs.clamav.net
https://www.clamav.net/
GNU General Public License v2.0
4.38k stars 704 forks source link

Text is not parsed in some PDFs #937

Open SecT0uch opened 1 year ago

SecT0uch commented 1 year ago

Describe the bug

I've got a PDF sample where clamav is not able to extract the text, while pdftotext (https://poppler.freedesktop.org) and pdf2txt.py (https://pdfminersix.readthedocs.io/en/latest/) can.

How to reproduce the problem

clamscan CG.pdf --leave-temps to keep the normalized files.

The document contain the following text: "This document contains protected files" grep -ri protected /tmp/20230601_094730-CG.pdf.7c648d30d0 returns nothing.

Replace this text with the output from the ClamAV command:

Config file: clamd.conf
-----------------------
LogFile = "/var/log/clamav/clamd.log"
LogTime = "yes"
PidFile = "/run/clamav/clamd.pid"
TemporaryDirectory = "/tmp"
LocalSocket = "/run/clamav/clamd.ctl"
User = "clamav"

Config file: freshclam.conf
---------------------------
PidFile = "/run/clamav/freshclam.pid"
UpdateLogFile = "/var/log/clamav/freshclam.log"
DatabaseMirror = "database.clamav.net"

Config file: clamav-milter.conf
-------------------------------
LogFile = "/var/log/clamav/clamav-milter.log"
LogTime = "yes"
PidFile = "/run/clamav/clamav-milter.pid"
TemporaryDirectory = "/tmp"
User = "clamav"

Software settings
-----------------
Version: 1.0.1
Optional features supported: MEMPOOL AUTOIT_EA06 BZIP2 LIBXML2 PCRE2 ICONV JSON RAR 

Database information
--------------------
Database directory: /var/lib/clamav
daily.cvd: version 26916, sigs: 2035172, built on Tue May 23 09:22:39 2023
main.cvd: version 62, sigs: 6647427, built on Thu Sep 16 14:32:42 2021
bytecode.cvd: version 334, sigs: 91, built on Wed Feb 22 22:33:21 2023
Total number of signatures: 8682690

Platform information
--------------------
uname: Linux 6.3.5-1-MANJARO #1 SMP PREEMPT_DYNAMIC Tue May 30 16:59:18 UTC 2023 x86_64
OS: Linux, ARCH: x86_64, CPU: x86_64
Full OS version: "Manjaro Linux"
zlib version: 1.2.13 (1.2.13), compile flags: a9
platform id: 0x0a21a1a108000000000c0201

Build information
-----------------
GNU C: 12.2.1 20230201 (12.2.1)
sizeof(void*) = 8
Engine flevel: 161, dconf: 161

Attachments

It is a phishing PDF, containing a link to a malicious website.

File can be found here: https://www.virustotal.com/gui/file/5faeb2ce23c9f86b085ec11733bc711ba1ca410d506e9df0be5f19c2db1730cc

Sanesecurity commented 1 year ago

Not sure if it's the same sort of issue but here's a pdf section part with --leave-temp

[<0044> <0065> <0061> <0072> <0009> <0063> <006C> <0069> <006E> <0074> <002C> <0056> <0066> <0079> <0064> <0073> <0075> <006F> <0070> <006D> <007A> <002E> <0054> <0068> <006A> <006B> <0067> <0076>

The text in the pdf is:

"Dear client, We are sorry to inform you that you account is currently frozen you can't (deposit, withdrawal, convert, transfer...) any of your funds until you confirm your account details."

but no text is made available to match on.

Example:

https://www.virustotal.com/gui/file/f5f3708dbca1f427834232cc6f1d5d755891547a683c1c7e18bbe6da527db6d4/details

SecT0uch commented 1 year ago

Unfortunately I don't have a VT premium to test your file but this sounds like the right direction. I don't have the same pattern, but will try to look for the string as Hex in other formats