PUNCH-Cyber / stoq

An open source framework for enterprise level automated analysis.
https://stoq.punchcyber.com
Apache License 2.0
394 stars 55 forks source link

make md5 part of core #153

Closed ytreister closed 4 years ago

ytreister commented 4 years ago

I have at least two plugins that I use that rely on payload's MD5. I plan on adding more dependencies on MD5 (see my latest comment for issue #122)...

I think it would make sense to include MD5 in the core so that all worker plugins can retreive the MD5 from the Payload object that gets passed in.

mlaferrera commented 4 years ago

I don't have any plans to add any hashing into the core framework, this is much better suited to be done within a plugin.

In your plugin configuration for the plugins that require the md5 hash, you can set the required_workers[1] option to hash. stoQ will ensure that the payload is hashed prior to being sent to your plugins, then your plugin will be able to access the results via the Payload object.

[1] https://stoq-framework.readthedocs.io/en/latest/dev/workers.html#required-workers

ytreister commented 4 years ago

Yes, this is currently how I do it, I have a hash plugin and use required_workers. I currently have at least three plugins where rather than sending the payload I can send an md5 because the plugins store previous results. This improves performance. Because I have to wait for the hash plugin to run, I have to wait an additional round for the plugins that need md5. This sometimes hurts performance if there is another long running task in the same round that the md5 gets computed.

Perhaps the core framework could add a generalized way to perform a simple task on all payloads. I am thinking, when a user configures the Stoq instance, they could pass in a dictionary of lambda functions, something like:

from hashlib import md5
Stoq(
    additional_core_tasks = {
        'md5': lambda: payload : md5(payload.content).hexdigest()
    }
)

which would cause each payload to have the md5 in their results:

{
    "results": [
        {
            "payload_id": "00d2f069-d716-43ed-bc2f-b0bd295574d4",
            "size": 507904,
            "md5": "..."
            ...
       }
mlaferrera commented 4 years ago

I'd be curious to know how much of an impact waiting for the second round really has. If it's substantial, it may be something we focus on for the next major version of stoQ.

That's an interesting idea, additional_core_tasks. We will have to do a bit of experimentation and gather some additional use cases, but it is definitely an intriguing concept.

Thanks again for all of the great feedback and ideas!

ytreister commented 4 years ago

So I have a plugin named 'filetype' that is always run, and dispatches based on the file type found. This will be run in "round 1". I have another plugin named 'file_details' which computes hashes (md5, etc.) that requires the results from 'filetype' and therefore runs in "round 2". I then have a couple of plugins ('metadefender', etc.) that depend on the hashes that are computed in 'file_details'. These have to run in "round 3" since they require the md5 that get computed in "round 2".

Obviously, I could create a simple plugin that only computes md5 and set that to always run, but that would have some boilerplate for a simple one liner.

I think the extra waiting could be long depending on how big file is, what plugins are run in the round that is waited, etc. Also if scaling up analysis seconds could turn into minutes, minutes into hours...

In the core it was already decided that there are some basics to compute compute:

Would be nice to give flexibility for other basics.

mlaferrera commented 4 years ago

I don't foresee supporting generating a hash for every payload, especially considering the limited use case and the fact there are simple methods already available (namely a plugin that generates the hash and always runs).

I certainly understand the compounding affect of awaiting plugin completion, and it is something we seriously take into account as we develop the core framework, as well as plugins. Your concept of a simple prologue for each payload may be the best path forward, but I think we need to dig into how to effectively implement that before committing to it.