Cisco-Talos / clamav

ClamAV - Documentation is here: https://docs.clamav.net
https://www.clamav.net/
GNU General Public License v2.0
4.21k stars 686 forks source link

Scanning big PPTX file is very slow #1139

Open JustinasKO opened 8 months ago

JustinasKO commented 8 months ago

Describe the bug

ClamAV 1.2.1/27129/Wed Dec 20 11:38:37 2023

Then running scan on bigger container files like pptx, scan takes ages. For example 75Mb pptx file on my MacBook pro with M2 chip usually takes ~70s. No matter if multithreading is enabled or not it always uses 99% of single cpu and clamdtop shows that scan uses 1 thread. Scan is even slower when running on intel cpu VM's in AWS EC2 instances with very little difference on instance types etc even on faster compute oriented types.

➜  ~ clamdscan ~/Downloads/test_presentation.pptx
/Users/me/Downloads/test_presentation.pptx: OK

----------- SCAN SUMMARY -----------
Infected files: 0
Time: 69.083 sec (1 m 9 s)
Start Date: 2023:12:22 15:25:49
End Date:   2023:12:22 15:26:58

If file is unarchived (unzip utility takes 1-2s) first and directory test_presentation created. running scan with single thread on that dir takes ~16s

➜  ~ clamdscan ~/Downloads/test_presentation/   
/Users/me/Downloads/test_presentation: OK

----------- SCAN SUMMARY -----------
Infected files: 0
Time: 16.110 sec (0 m 16 s)
Start Date: 2023:12:22 16:09:34
End Date:   2023:12:22 16:09:50

And finally same directory with multithreading enabled (10 treads) ~3s clamdtop shows all 10 threads utilised

➜  ~ clamdscan ~/Downloads/test_presentation/ -m
/Users/me/Downloads/test_presentation: OK

----------- SCAN SUMMARY -----------
Infected files: 0
Time: 3.168 sec (0 m 3 s)
Start Date: 2023:12:22 16:04:20
End Date:   2023:12:22 16:04:24

Why ClamAV not utilising multithreading on container type files? Can this be somehow configured? is there any plans on improving container type file scan performance?

How to reproduce the problem

Can't share exact file here but please contact me to get it privately.

ragusaa commented 8 months ago

Hi,

Thank you for the report. I am creating a ticket internally to track this issue, and will let you know when it's scheduled. I am reaching out to ask for this file.

Thanks, Andy

micahsnyder commented 6 months ago

Checking in on this. @JustinasKO were you able to share the file with @ragusaa? You may try sharing it via direct message on Discord: https://docs.clamav.net/#chat

It's hard to say what is causing scan performance issues with this file. We'll need to test it. Without this, there isn't much to be done for your specific request.

On a related topic: There some concerns over scan performance with images due to a few bytecode signatures that take a relatively long time to run. These are:

I'm looking into having these removed as the CVE's are ~7 years old now and no one should be affected anymore. I'm hopeful this will improve scan performance for files with a lot of JPEG attachments.

You asked about multithreading. ClamD supports concurrent scans of multiple files when scanning a directory given a file path, and supports multiple threads when multiple clients simultaneously request scans.
However, files extracted from archives are not multithreaded. It would require a large redesign of multiple internal components to support concurrency in scanning embedded/extracted content.