RFC: Controllable task properties

yankovs commented 2 years ago

Buckle up, it's a long one! I'd like to suggest the use of a "properties" payload (or a Karton Task's class member) with some common properties one can define for a task in karton.
This properties payload is currently used in Checkpoint research in our production kartons and mwdb. We think it will be wise to use it across CERTPL's existing kartons.

*Note that we will use Payload as an implementation example, but it can also be anything else.

The idea

Encourage users to add a payload (if part of karton itself, than a dictionary item in KartonTask will be better) called properties to their tasks. These properties will instruct several Kartons (developed by CERT.PL) how to treat the task or payloads inside it. The properties we found useful are the following:

volatile: boolean, defaults to false
emulate: boolean, defaults to true
force_analyze: boolean, defaults to false
force_emulate: boolean, defaults to false

Usage

I think the best way to show why it might be a good idea is from examples of actual, running kartons.

volatile

The main usage of the volatile property is to instruct kartons to treat the sample of the task as volatile, meaning there's no need to report it to external databases (e.g MWDB, MISP). An easy implementation in karton-mwdb-reporter that will prevent the karotn from report a file to MWDB will look like this:

def process(self, task):
        # Checking if task is volatile.
        if task.has_payload("properties"):
            if task.get_payload("properties").get("volatile") == True:
                self.log.info("Task is volatile. Finishing here.")
                return
        ...

This way, samples won't be reported to MWDB in presence of a positive volatile property; either because it is some auxiliary sample made for testing, or a sample not interesting enough to have uploaded to the database. Consequently, another simple yet nice use of the volatile property is to create some internal testing/utility producers to be used by developers/analysts/researchers etc.
For instance, let's take a look at tester-producer.py:

import os, sys
from karton.core import Producer, Resource, Task

producer = Producer()

filename = sys.argv[1]
with open(filename, "rb") as f:
    content = f.read()

resource = Resource(os.path.basename(filename), content)

task = Task({"type": "sample", "kind": "raw"})
task.add_payload("sample", resource)
task.add_payload("properties", {"volatile": True})

producer.send_task(task)

It can help locally test staging or even production environments without the need to store the files in a datbase (MWDB), and still get the whole flow of the system. It can also be used to quickly analyze a file without having it show up in MWDB and sandboxed -- that's what the emulate keyword is for.

An example for something we've done recently using this approach, is that we sent millions of PDF files to our karton system, and we don't want them to be stored in MWDB unless we believe a PDF sample contains an exploit. So we produced these millions of tasks with the volatile flag turned on, and prevented them to flood mwdb.

emulate

Another useful usage is for controlling sandboxing and analysis of samples, in case you've set up some system to sandbox samples and report the results back into karton (which we all do).
Let's say you've set up feeders to get samples from different sources. It may happen that the same samples will come from different sources again and again. In such case we do not want to re-sandbox the sample, since emulating a file in a sandbox is costly and will probably pollute the database with redundant dumps and artifacts. A simple solution might be adding the following code to the kartons responsible for sending samples to be sandboxed:

# a previous check was done to ensure the sample
# was already emulated
if task.get_payload("properties").get("emulate") == False:
    self.log.info("Not going to emulate it.")
    ...
else:
    # logic that sends to a sandbox
    ...

force_analyze and force_emulate

These two, as the name suggests, force a sample to be analyzed/sandboxed. In the case of sandboxing, force_emulate will override emulate, whatever its value may be. The main usage of these two is for sample sources which heuristically produce "interesting" samples. For example, the upload button in MWDB, or the "reanalyze" button in MWDB. It is reasonable to assume that if an analyst chose to upload a sample, or re-analyze it, it is because they think it's an interesting one and would want it to be emulated.
Therefore, we suggest the following small change in mwdb-core/mwdb/core/karton.py:

task = Task(
            headers={"type": "sample", "kind": "raw", "quality": feed_quality},
            payload={
                "sample": Resource(file.file_name, path=path, sha256=file.sha256),
                "attributes": file.get_attributes(as_dict=True, check_permissions=False),
                "properties": {"force_analyze": True, "force_emulate": True}, # ensures what we wanted
            },
            priority=task_priority,
        )

And thus, when implementing a karton/system to dispatch samples from karton into a sandbox, this simple logic will work:

if task.get_payload("properties").get("force_emulate") == True:
    # send to a sandbox logic
    ...
elif task.get_payload("properties").get("emulate") == True:
    # send to a sandbox logic
    ...
else:
    # no emulation for you
    ...

These properties can be mixed and matched for different scenarios. For example:

force analyze a complete flow of an existing file (reanalyze on MWDB)

properties: {
    "force_analyze": True,
    "force_emulate": True,
}

new files to be analyzed by kartons without emulation and without saving
```
properties: {
    "emulate": False,
    "volatile": True
}
```
normal file analysis (default values):
```
Could be without properties
```
normal file analysis without sandbox:
```
properties: {
    "emulate": False,
}
```

Additional points

backwards compatibility(?)

The only case in which backwards compatibility is broken is if someone already implemented the same properties payload in their kartons, and wrote logic revolving around these exact properties that were mentioned. Then it might cause a conflict However I believe this is not a problem because it is way too specific of a change to the logic. Also, if someone actually had these changes, they uses a forked version of the official kartons (reporter, for example). So there are two cases: either this addition helps them move to upstream version and not use forked repos, or they continue using a fork as they did up until now.

Must be a payload?

No, it does not. It can also be something integrated directly into the Task calss like "priority". We used apyload because it was the easiest to implement without branching from karton-core and other kartons. Some other alternatives might be adding these properties to the header of tasks. It may be a personal preference but I think they should belong to a dictionary field (payload or not) called properties because these fields are ultimately user controllable, whereas thing like kind or type are internal values inferred as the task moves through different kartons.

Ok, so what actually needs to be changed?

The two main changes were mentioned above, in karton-mwdb-reporter and karton.py from mwdb-core. Other examples of places where it makes sense is, for example, this example drakvuf producer. If there are other places where you think this might be useful, We'd be glad to discuss it and implement it if needed.
Another area what'll probably need to be changed is the docs; to let people know this exists. This issue can be used as a basis for the docs. In any case, we have no problem handling documentation :)

Of course, these are just a limited number of examples and anyone can extend this idea to whatever they might need. Let us know what you think about it.

ITAYC0HEN commented 2 years ago

Hey dears! :) @psrok1, @nazywam - we'd love your feedback about the suggestions. What do you think? do you find it as useful as we do? Do you see the needs? have other solutions, etc?

psrok1 commented 2 years ago

Hi! Thanks for broad explanation, we really appreciate it! Let me write out some random thoughts and questions on it:

1) It's really close to another idea which is standardization of some common header fields in Karton. For example: type, kind and stage are universal fields that must be followed by every Karton and they're implementing specific contract which is enforced by services like karton-classifier or karton-config-extractor but were never documented.

One of ideas was to prepare classes like TaskHeader(type="sample", kind="raw", **other_header_fields) where standardization comes with documentation and typing. Of course we should be still able to define a task using plain dicts for backwards compatibility and support for non-standard tasks. In that model, we can provide a special place for properties as well.

2) volatile: True is really nice addition, if I correctly understand the meaning: "Don't provide too much persistence for that artifact, in current form it's not that important". emulate: False means not to perform heavy analysis (e.g. dynamic analysis in sandbox). Maybe force_emulate is just emulate: True? These fields might be defined as optional and presence of that field may suggest that specific type of processing is enforced by task producer.

3) What's the difference between force_emulate and force_analyze?

msm-cert commented 10 months ago

Hi! Welcome back. Thanks for making this detailed issue (I didn't notice it until today).

Since persistent headers are now implemented we can go back to thinking about this. I have to say this will be quite useful for us too.

The things I already know (I've asked people internally what do they think about it):

I think we all agree that we don't like force_emulate and emulate separation. If that's not critical, we plan to only support a single flag (emulate using this naming convention)
similarly, we don't understand what force_analyze is for. We don't plan to support it, until we get a better understanding of it

So for now we're just interested in features provided by volatility and emulate.

Now some of my personal thoughts:

volatility is a more complex concept than it looks like IMO, because there's a lot of external services touched during the analysis in our pipeline. Of course we upload samples to mwdb, but there are many more moving parts: configs go to mtracker, iocs go to n6 and misp, samples themselves go to apkdetect and of course drakvuf. I think before jumping to implementation we should decide exactly what volatile means and maybe define "volatility levels"
emulate is a nice idea and solves our immediate problem. But I'm slightly concerned about duplicating concepts and I wonder if the same could be achieved with a hypothetical "ultralow" feed quality level?

benefits:

simpler implementation, less "special" headers
there's already a nice support in mwdb for quality, and it can even be set per user (exactly our use case)
it has a clearer semantics. Our real intention is to avoid running heavy tasks on samples from "mass" feeds and avoid overwhelming our infrastructure. There may be more services not fit for mass analysis (we used to have one more cost intensive karton service that did decompilation - it would also benefit from this)

downsides:

adding a new enum field may be not backward compatible, and we'll need to check every karton service we have
it's more work than just adding a persistent "emulate" header.

I'm still not decided and would appreciate some perspective.

CERT-Polska / karton