IV1T3 / django-middleware-fileuploadvalidation

A Django middleware to validate user file uploads and detect malicious content.
Apache License 2.0
6 stars 2 forks source link

Implement PDF malicious detection and sanitization support #1

Closed IV1T3 closed 2 years ago

IV1T3 commented 2 years ago

O-checker could provide a good starting point. Especially, pdfanalysis.py illustrates how to detect specific malware inside a PDF.

GitHub: https://github.com/yotsubo/o-checker

Presentation: https://www.blackhat.com/docs/us-16/materials/us-16-Otsubo-O-checker-Detection-of-Malicious-Documents-through-Deviation-from-File-Format-Specifications.pdf

Whitepaper: https://www.blackhat.com/docs/us-16/materials/us-16-Otsubo-O-checker-Detection-of-Malicious-Documents-through-Deviation-from-File-Format-Specifications-wp.pdf

IV1T3 commented 2 years ago

Interesting PDF analyzation tool: https://blog.didierstevens.com/programs/pdf-tools/ Also used as forensic package in Kali: https://tools.kali.org/forensics/pdfid Pypi package usable as a library: https://github.com/mlodic/pdfid

Example Output of pdfid.py Release 1.0.4 (2021/03/25):

{  
    '/AA': 0,
    '/AcroForm': 0,
    '/Colors > 2^24': 0,
    '/EmbeddedFile': 0,
    '/Encrypt': 0,
    '/JBIG2Decode': 0,
    '/JS': 0,
    '/JavaScript': 0,
    '/Launch': 0,
    '/ObjStm': 0,
    '/OpenAction': 0,
    '/Page': 36,
    '/RichMedia': 0,
    '/XFA': 0,
    'endobj': 469,
    'endstream': 186,
    'filename': 'analyzing.pdf',
    'header': '%PDF-1.4',
    'obj': 469,
    'startxref': 1,
    'stream': 186,
    'trailer': 1,
    'version': '0.2.7',
    'xref': 1
}

The output description is as following:

Almost every PDF documents will contain the first 7 words (obj through startxref), and to a lesser extent stream and endstream. I’ve found a couple of PDF documents without xref or trailer, but these are rare (BTW, this is not an indication of a malicious PDF document).

/Page gives an indication of the number of pages in the PDF document. Most malicious PDF document have only one page.

/Encrypt indicates that the PDF document has DRM or needs a password to be read.

/ObjStm counts the number of object streams. An object stream is a stream object that can contain other objects, and can therefor be used to obfuscate objects (by using different filters).

/JS and /JavaScript indicate that the PDF document contains JavaScript. Almost all malicious PDF documents that I’ve found in the wild contain JavaScript (to exploit a JavaScript vulnerability and/or to execute a heap spray). Of course, you can also find JavaScript in PDF documents without malicious intend.

/AA and /OpenAction indicate an automatic action to be performed when the page/document is viewed. All malicious PDF documents with JavaScript I’ve seen in the wild had an automatic action to launch the JavaScript without user interaction.

The combination of automatic action and JavaScript makes a PDF document very suspicious.

/JBIG2Decode indicates if the PDF document uses JBIG2 compression. This is not necessarily and indication of a malicious PDF document, but requires further investigation.

/RichMedia is for embedded Flash.

/Launch counts launch actions.

/XFA is for XML Forms Architecture.

A number that appears between parentheses after the counter represents the number of obfuscated occurrences. For example, /JBIG2Decode 1(1) tells you that the PDF document contains the name /JBIG2Decode and that it was obfuscated (using hexcodes, e.g. /JBIG#32Decode).

BTW, all the counters can be skewed if the PDF document is saved with incremental updates.

Because PDFiD is just a string scanner (supporting name obfuscation), it will also generate false positives. For example, a simple text file starting with %PDF-1.1 and containing words from the list will also be identified as a PDF document.

IV1T3 commented 2 years ago

Submitted PR to extend functionality for DMF: https://github.com/mlodic/pdfid/pull/3

IV1T3 commented 2 years ago

Add more to PDF M score:

/JS 0 #This indicates the presence of Javascript /JavaScript 0 #This indicates the presence of Javascript /AA 0 #This indicates the presence of automatic action on opening /OpenAction 0 #This indicates the presence of automatic action on opening /AcroForm 0 #This indicates the presence of AcroForm which could contain JavaScript /JBIG2Decode 0 #This indicates the use of JBIG2 compression which could be used for obfuscating content /RichMedia 0 #This indicates the presence of rich media within the PDF such as Flash /Launch 0 #This counts the launch actions /EmbeddedFile 0 #This indicates there are embedded files within the PDF /XFA 0 #This indicates the presence of XML Forms within the PDF

IV1T3 commented 2 years ago

ToDO:

IV1T3 commented 2 years ago

Just created a new PR for https://github.com/mlodic/pdfid/pull/4.

This feature will allow to sanitize given PDFs in memory.

IV1T3 commented 2 years ago

PDF detection and sanitization now implemented as of commit 2a7fe33.