PUNCH-Cyber / stoq-plugins-public

stoQ Public Plugins
https://stoq.punchcyber.com
Apache License 2.0
72 stars 24 forks source link

Extracting files from ZIP archives based on MIME type #108

Closed serializingme closed 4 years ago

serializingme commented 4 years ago

I have various ZIP archives files which are several GiB in size which expand to several GiB more. These archives contain a multitude of different file types but I'm only interested in some specific file types. Using the decompress plugin won't quite cut it as it doesn't support filtering which files are extracted based on the MIME type (I'm looking at reducing disk space and IO usage, and potentially extraction time).

As a proof-of-concept I came up with the code below (based on your mimetype plugin). It opens the ZIP file, checks the MIME type of each archived file, and if the MIME type matches, it extracts the file - it works quite nicely. The question are:

#!/usr/bin/env python3
import magic
import zipfile

if hasattr(magic.Magic, 'from_buffer'):
    USE_PYTHON_MAGIC = True
else:
    USE_PYTHON_MAGIC = False

with zipfile.ZipFile('archive-0001.zip') as targetzip:
    for member in targetzip.namelist():
        with targetzip.open(member) as compressedfile:
            payload = compressedfile.read(1000)

            if USE_PYTHON_MAGIC:
                magic_scan = magic.Magic(mime=True)
                magic_result = magic_scan.from_buffer(payload)
            else:
                with magic.Magic(flags=magic.MAGIC_MIME_TYPE) as m:
                    magic_result = m.id_buffer(payload)

            if hasattr(magic_result, 'decode'):
                magic_result = magic_result.decode('utf-8')

            if magic_result != 'application/x-dosexec':
                continue

        targetzip.extract(member, path='./extracted')
mlaferrera commented 4 years ago

Hi @serializingme -- This probably won't work as well with the decompress plugin since it leverages various decompression executables, resulting in all compressed files being extracted before stoQ touches the content. So a stand alone plugin would probably be best.

As for your second question, it really depends on what type of plugin class you'd like this one to be. If you want it to be a provider plugin that is capable of only processing zip files, then extracting and adding them to the stoQ Queue object would work. If you want it to be used in a standard pipeline, then a worker plugin would be best. With that said, you could also write it as a multiclass plugin where it supports both plugin classes.

I hope this helps!

serializingme commented 4 years ago

Thanks for the clarifications :D I will focus on implementing it as a provider plugin since it fits my use case best. Once that is done I can potentially look into implementing it as a worker plugin.