Open robcza opened 8 years ago
You don't need to encode JSON in JSON, just save a dictionary.
My initial proposal. Data:
[
{"ip": "127.0.0.1", "time": "324543262"},
{"ip": "127.0.0.1", "time": "324543262"},
]
Configuration:
"parameters": {
"transform": [
{"ip": {"source.ip": null}, "time": {"source.time": "timestampToDatetime"}}
]
}
The point of the proposal is, that the transform-struct has the same structure as the data itself. The data itself is replaced by the field name and transformation function names.
I'm not sure how to mark the fields itself to make them easily detectable by the code.
Mobile
On 30.06.2016, at 22:20, Sebastian notifications@github.com wrote:
You don't need to encode JSON in JSON, just save a dictionary.
My initial proposal. Data:
[ {"ip": "127.0.0.1", "time": "324543262"}, {"ip": "127.0.0.1", "time": "324543262"}, ] Configuration:
"parameters": { "transform": [ {"ip": {"source.time": null}, "time": {"source.time": "timestampToDatetime"}} ] }
IMHO this is not very clear to read.
While I like JSON as a format , I think it is a bad structure for expressing transformation rules or context free grammars. Not very readable. We have the same issue with the modify.conf syntax.
Maybe there are other good libraries for this purpose that we can re-use?
I can imagine a syntax such as:
IP -> source.ip () Time -> source.time ( timestamptodatetime) ...
I know this breaks JSON.
Or we start with the JSON format as described by Sebix but please let's move to something readable in version 2.
The point of the proposal is, that the transform-struct has the same structure as the data itself. The data itself is replaced by the field name and transformation function names.
I'm not sure how to mark the fields itself to make them easily detectable by the code.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Looking at the two ideas I think the one @sebix proposed would be quite clear to code, which is tempting. @aaronkaplan I understand you would prefer a more readable format, but I don't want to introduce a special format just for one particular bot. This should be agreed upon a higher scope - for the modify bot and possibly others using special configurations.
I suggest using the format @sebix described, I like it is easily extensible.
Mobile
On 30.06.2016, at 23:44, Robert Šefr notifications@github.com wrote:
Looking at the two ideas I think the one @sebix proposed would be quite clear to code, which is tempting. @aaronkaplan I understand you would prefer a more readable format, but I don't want to introduce a special format just for one particular bot. This should be agreed upon a higher scope - for the modify bot and possibly others using special configurations.
I suggest using the format @sebix described, I like it is easily extensible.
I see that point but to be honest - already the modify.conf file is quite unreadable.
Let's please think about the usability issues some more before implementing new features. In the long run usability and most importantly simplicity is more relevant that the possibility to quickly implement a change.
It's all about the users at the end of the day.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Ok, I agree with your points and I know you fight for this long term. I think we should decide a common notation for all such bots and keep it out of the runtime.conf. Should this be yaml? It is readable and has similar features as json.
Agreed. Do you have any proposal ?
Mobile
On 02.07.2016, at 12:26, Robert Šefr notifications@github.com wrote:
Ok, I agree with your points and I know you fight for this long term. I think we should decide a common notation for all such bots and keep it out of the runtime.conf. Should this be yaml? It is readable and has similar features as json.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Ok, my proposal is as follows. Data:
[
{"ip": "127.0.0.1", "time": "324543262", "type": "bad thing"},
{"ip": "127.0.0.1", "time": "324543262", "type": "unknown"},
]
Configuration:
- translate:
ip: source.ip
time: time.source
type: classification.type
- functions:
time.source: timestamptodatetime
classification.type: type_translation
- parameters:
type_translation:
bad thing: malware
unknown: blacklist
Comments and improvement suggestions are welcome.
I share the thought, that parsers should have a common configuration logic. The new shadowserver parsers use something similar, but It's more trivial as shadowserver-csv does not contain nested objects. If you want to have a look at it: https://github.com/Intevation/intelmq/blob/universal-shadowserver-parser/intelmq/bots/parsers/shadowserver/config.py
@dmth thank you a lot for bringing this to this thread, in your notation the configuration would be like this I presume:
config = {
'required_fields': [
('time.source', 'time', timestamptodatetime),
('source.ip', 'ip'),
('classification.type', 'type', type_translation)
],
'parameters': {
'type_translation': {
('bad thing': 'malware'),
('unknown': 'blacklist')
}
}
}
The notation is readable and seem to simple to use. However I see one major problem, python code is subject to AGPL license and every single instance of IntelMQ would have to publish such configuration to be compliant. For shadowserver parser that seems ok, for generic parser it could be a blocker.
Yes, that would match our idea. Please note: Functions for type-translation are only designed for one parameter right now. That will not be sufficient.
Your licensing concerns are valid if translation-functions are created (bc. they are new code). If the file does only contain configuration it would no be subject to AGPL. But I can only assume that from my point of view, and I'm not a lawyer.
Please note: Functions for type-translation are only designed for one parameter right now. That will not be sufficient.
And they only have one return value. That could be problematic too if we save one JSON-value in two of our fields.
So, ideally (let's free our minds for a moment from formats) a transformation system looks like this:
time.source = timestamptodatetime(time) # comment: transform via the function
source.ip = ip # no transformation, simple assignment
classification.type = type_translation(type, translation_table, default="unknown") # example with multiple parameters, where translation_table of course needs to be defined.
The first parameter of any translation function must be the source value which gets translated. The other parameters of the function may denote optional parameters which are needed for a proper translation. default= values are OK. The translation system must not contain loops or if-thens (so that every translation can be handled in a fixed amount of time (we can give an upper bound)).
I believe this type of syntax would be the most readable, the most direct. And hey, actually it can be a subset of regular python code....
Another alternative (for JSON transformations) might be the jq syntax. But that is not a generic transformation syntax. It is only specific for JSON.
Ok, I think we can go with the python notation. The config file will be loaded dynamically (name configured as a runtime parameter). Functions could be defined in the config as well, however such configurations has to be published to be AGPL compliant. I prefer functions to be defined in the parser itself and documented.
How do you want to do the configuration of a hierarchical structure as JSON? The example in https://github.com/certtools/intelmq/issues/553#issuecomment-230335760 is not hierarchical.
I've been playing with the python native notation and tried to achieve something similar, what Aaron proposed https://github.com/certtools/intelmq/issues/553#issuecomment-230456234 The result was quite disappointing and I would like to try YAML with the function notation:
- translate:
source.ip: ip
time.source: timestamptodatetime(time)
classification.type: type_translation(type, translation_table, default)
- parameters:
translation_table:
bad thing: malware
unknown: blacklist
default: blacklist
However, multiple function parameters could introduce some unwanted complexity. What property do you mean "hierarchical"? Could you post an example please?
We have an existing parser for fraunhofer dga. It's testdata looks like this:
{
"banjori_dga_andersensinaix.com_0x3c03": [
"andersensinaix.com",
"xjsrrsensinaix.com",
"hlrfrsensinaix.com",
"fnosrsensinaix.com",
"128.238.197.33",
"lbzorsensinaix.com",
"sgjprsensinaix.com"
]
}
How would the configuration look like?
What other data sources using JSON do you have in mind to use the parser with?
I'd like to add another generic parser, this time for json reports. While the csv parser was pretty straightforward, this one seems a bit more tricky. Transforming generic json to intelmq event could require quite complex notation. I struggle to have it generic enough while keeping it simple to configure. My recent idea is to keep the runtime parameter as simple as this:
for example:
Comments, advisory, ideas very welcome @sebix @aaronkaplan @SYNchroACK