Open yankovs opened 2 years ago
Hey dears! :) @psrok1, @nazywam - we'd love your feedback about the suggestions. What do you think? do you find it as useful as we do? Do you see the needs? have other solutions, etc?
Hi! Thanks for broad explanation, we really appreciate it! Let me write out some random thoughts and questions on it:
1) It's really close to another idea which is standardization of some common header fields in Karton. For example: type
, kind
and stage
are universal fields that must be followed by every Karton and they're implementing specific contract which is enforced by services like karton-classifier
or karton-config-extractor
but were never documented.
One of ideas was to prepare classes like TaskHeader(type="sample", kind="raw", **other_header_fields)
where standardization comes with documentation and typing. Of course we should be still able to define a task using plain dicts for backwards compatibility and support for non-standard tasks. In that model, we can provide a special place for properties
as well.
2) volatile: True
is really nice addition, if I correctly understand the meaning: "Don't provide too much persistence for that artifact, in current form it's not that important". emulate: False
means not to perform heavy analysis (e.g. dynamic analysis in sandbox). Maybe force_emulate
is just emulate: True
? These fields might be defined as optional and presence of that field may suggest that specific type of processing is enforced by task producer.
3) What's the difference between force_emulate
and force_analyze
?
Hi! Welcome back. Thanks for making this detailed issue (I didn't notice it until today).
Since persistent headers are now implemented we can go back to thinking about this. I have to say this will be quite useful for us too.
The things I already know (I've asked people internally what do they think about it):
force_emulate
and emulate
separation. If that's not critical, we plan to only support a single flag (emulate
using this naming convention)force_analyze
is for. We don't plan to support it, until we get a better understanding of itSo for now we're just interested in features provided by volatility
and emulate
.
Now some of my personal thoughts:
volatility
is a more complex concept than it looks like IMO, because there's a lot of external services touched during the analysis in our pipeline. Of course we upload samples to mwdb, but there are many more moving parts: configs go to mtracker, iocs go to n6 and misp, samples themselves go to apkdetect and of course drakvuf. I think before jumping to implementation we should decide exactly what volatile means and maybe define "volatility levels"emulate
is a nice idea and solves our immediate problem. But I'm slightly concerned about duplicating concepts and I wonder if the same could be achieved with a hypothetical "ultralow" feed quality level?benefits:
quality
, and it can even be set per user (exactly our use case)downsides:
I'm still not decided and would appreciate some perspective.
Buckle up, it's a long one! I'd like to suggest the use of a "properties" payload (or a Karton Task's class member) with some common properties one can define for a task in karton.
This properties payload is currently used in Checkpoint research in our production kartons and mwdb. We think it will be wise to use it across CERTPL's existing kartons.
*Note that we will use Payload as an implementation example, but it can also be anything else.
The idea
Encourage users to add a payload (if part of karton itself, than a dictionary item in KartonTask will be better) called
properties
to their tasks. These properties will instruct several Kartons (developed by CERT.PL) how to treat the task or payloads inside it. The properties we found useful are the following:volatile
:boolean
, defaults tofalse
emulate
:boolean
, defaults totrue
force_analyze
:boolean
, defaults tofalse
force_emulate
:boolean
, defaults tofalse
Usage
I think the best way to show why it might be a good idea is from examples of actual, running kartons.
volatile
The main usage of the
volatile
property is to instruct kartons to treat the sample of the task as volatile, meaning there's no need to report it to external databases (e.g MWDB, MISP). An easy implementation inkarton-mwdb-reporter
that will prevent the karotn from report a file to MWDB will look like this:This way, samples won't be reported to MWDB in presence of a positive
volatile
property; either because it is some auxiliary sample made for testing, or a sample not interesting enough to have uploaded to the database. Consequently, another simple yet nice use of thevolatile
property is to create some internal testing/utility producers to be used by developers/analysts/researchers etc.For instance, let's take a look at
tester-producer.py
:It can help locally test staging or even production environments without the need to store the files in a datbase (MWDB), and still get the whole flow of the system. It can also be used to quickly analyze a file without having it show up in MWDB and sandboxed -- that's what the
emulate
keyword is for.An example for something we've done recently using this approach, is that we sent millions of PDF files to our karton system, and we don't want them to be stored in MWDB unless we believe a PDF sample contains an exploit. So we produced these millions of tasks with the
volatile
flag turned on, and prevented them to flood mwdb.emulate
Another useful usage is for controlling sandboxing and analysis of samples, in case you've set up some system to sandbox samples and report the results back into karton (which we all do).
Let's say you've set up feeders to get samples from different sources. It may happen that the same samples will come from different sources again and again. In such case we do not want to re-sandbox the sample, since emulating a file in a sandbox is costly and will probably pollute the database with redundant dumps and artifacts. A simple solution might be adding the following code to the kartons responsible for sending samples to be sandboxed:
force_analyze and force_emulate
These two, as the name suggests, force a sample to be analyzed/sandboxed. In the case of sandboxing,
force_emulate
will overrideemulate
, whatever its value may be. The main usage of these two is for sample sources which heuristically produce "interesting" samples. For example, theupload
button in MWDB, or the "reanalyze" button in MWDB. It is reasonable to assume that if an analyst chose to upload a sample, or re-analyze it, it is because they think it's an interesting one and would want it to be emulated.Therefore, we suggest the following small change in
mwdb-core/mwdb/core/karton.py
:And thus, when implementing a karton/system to dispatch samples from karton into a sandbox, this simple logic will work:
These properties can be mixed and matched for different scenarios. For example:
force analyze a complete flow of an existing file (reanalyze on MWDB)
new files to be analyzed by kartons without emulation and without saving
normal file analysis (default values):
normal file analysis without sandbox:
Additional points
backwards compatibility(?)
The only case in which backwards compatibility is broken is if someone already implemented the same
properties
payload in their kartons, and wrote logic revolving around these exact properties that were mentioned. Then it might cause a conflict However I believe this is not a problem because it is way too specific of a change to the logic. Also, if someone actually had these changes, they uses a forked version of the official kartons (reporter, for example). So there are two cases: either this addition helps them move to upstream version and not use forked repos, or they continue using a fork as they did up until now.Must be a payload?
No, it does not. It can also be something integrated directly into the Task calss like "priority". We used apyload because it was the easiest to implement without branching from karton-core and other kartons. Some other alternatives might be adding these properties to the header of tasks. It may be a personal preference but I think they should belong to a dictionary field (payload or not) called
properties
because these fields are ultimately user controllable, whereas thing likekind
ortype
are internal values inferred as the task moves through different kartons.Ok, so what actually needs to be changed?
The two main changes were mentioned above, in
karton-mwdb-reporter
andkarton.py
frommwdb-core
. Other examples of places where it makes sense is, for example, this example drakvuf producer. If there are other places where you think this might be useful, We'd be glad to discuss it and implement it if needed.Another area what'll probably need to be changed is the docs; to let people know this exists. This issue can be used as a basis for the docs. In any case, we have no problem handling documentation :)
Of course, these are just a limited number of examples and anyone can extend this idea to whatever they might need. Let us know what you think about it.