generic json parser - Githubissues

robcza commented 8 years ago

I'd like to add another generic parser, this time for json reports. While the csv parser was pretty straightforward, this one seems a bit more tricky. Transforming generic json to intelmq event could require quite complex notation. I struggle to have it generic enough while keeping it simple to configure. My recent idea is to keep the runtime parameter as simple as this:

"transform":"{\"<source_attribute>\": \"<destination_attribute>\", ....}"

for example:

"transform":"{\"source_ip\": \"source.ip\"}"

I don't like the json in json notation (backslashes make it unreadable)
this does not allow any advanced operations, like joining several attributes into one etc., but I doubt this is needed

Comments, advisory, ideas very welcome @sebix @aaronkaplan @SYNchroACK

sebix commented 8 years ago

You don't need to encode JSON in JSON, just save a dictionary.

My initial proposal. Data:

[
{"ip": "127.0.0.1", "time": "324543262"},
{"ip": "127.0.0.1", "time": "324543262"},
]

Configuration:

"parameters": {
    "transform": [
        {"ip": {"source.ip": null}, "time": {"source.time": "timestampToDatetime"}}
    ]
}

The point of the proposal is, that the transform-struct has the same structure as the data itself. The data itself is replaced by the field name and transformation function names.

I'm not sure how to mark the fields itself to make them easily detectable by the code.

aaronkaplan commented 8 years ago

Mobile

On 30.06.2016, at 22:20, Sebastian notifications@github.com wrote:

You don't need to encode JSON in JSON, just save a dictionary.

My initial proposal. Data:

[ {"ip": "127.0.0.1", "time": "324543262"}, {"ip": "127.0.0.1", "time": "324543262"}, ] Configuration:

"parameters": { "transform": [ {"ip": {"source.time": null}, "time": {"source.time": "timestampToDatetime"}} ] }

IMHO this is not very clear to read.

While I like JSON as a format , I think it is a bad structure for expressing transformation rules or context free grammars. Not very readable. We have the same issue with the modify.conf syntax.

Maybe there are other good libraries for this purpose that we can re-use?

I can imagine a syntax such as:

IP -> source.ip () Time -> source.time ( timestamptodatetime) ...

I know this breaks JSON.

Or we start with the JSON format as described by Sebix but please let's move to something readable in version 2.

The point of the proposal is, that the transform-struct has the same structure as the data itself. The data itself is replaced by the field name and transformation function names.

I'm not sure how to mark the fields itself to make them easily detectable by the code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

robcza commented 8 years ago

Looking at the two ideas I think the one @sebix proposed would be quite clear to code, which is tempting. @aaronkaplan I understand you would prefer a more readable format, but I don't want to introduce a special format just for one particular bot. This should be agreed upon a higher scope - for the modify bot and possibly others using special configurations.

I suggest using the format @sebix described, I like it is easily extensible.

aaronkaplan commented 8 years ago

Mobile

On 30.06.2016, at 23:44, Robert Šefr notifications@github.com wrote:

Looking at the two ideas I think the one @sebix proposed would be quite clear to code, which is tempting. @aaronkaplan I understand you would prefer a more readable format, but I don't want to introduce a special format just for one particular bot. This should be agreed upon a higher scope - for the modify bot and possibly others using special configurations.

I suggest using the format @sebix described, I like it is easily extensible.

I see that point but to be honest - already the modify.conf file is quite unreadable.

Let's please think about the usability issues some more before implementing new features. In the long run usability and most importantly simplicity is more relevant that the possibility to quickly implement a change.

It's all about the users at the end of the day.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

robcza commented 8 years ago

Ok, I agree with your points and I know you fight for this long term. I think we should decide a common notation for all such bots and keep it out of the runtime.conf. Should this be yaml? It is readable and has similar features as json.

aaronkaplan commented 8 years ago

Agreed. Do you have any proposal ?

Mobile

On 02.07.2016, at 12:26, Robert Šefr notifications@github.com wrote:

Ok, I agree with your points and I know you fight for this long term. I think we should decide a common notation for all such bots and keep it out of the runtime.conf. Should this be yaml? It is readable and has similar features as json.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

robcza commented 8 years ago

Ok, my proposal is as follows. Data:

[
{"ip": "127.0.0.1", "time": "324543262", "type": "bad thing"},
{"ip": "127.0.0.1", "time": "324543262", "type": "unknown"},
]

Configuration:

- translate:
    ip: source.ip
    time: time.source
    type: classification.type
- functions:
    time.source: timestamptodatetime
    classification.type: type_translation
- parameters:
    type_translation:
      bad thing: malware
      unknown: blacklist

Comments and improvement suggestions are welcome.

dmth commented 8 years ago

I share the thought, that parsers should have a common configuration logic. The new shadowserver parsers use something similar, but It's more trivial as shadowserver-csv does not contain nested objects. If you want to have a look at it: https://github.com/Intevation/intelmq/blob/universal-shadowserver-parser/intelmq/bots/parsers/shadowserver/config.py

robcza commented 8 years ago

@dmth thank you a lot for bringing this to this thread, in your notation the configuration would be like this I presume:

config = {
    'required_fields': [
        ('time.source', 'time', timestamptodatetime),
        ('source.ip', 'ip'),
        ('classification.type', 'type', type_translation)
    ],
    'parameters': {
        'type_translation': {
            ('bad thing': 'malware'),
            ('unknown': 'blacklist')
        }
    }
}

The notation is readable and seem to simple to use. However I see one major problem, python code is subject to AGPL license and every single instance of IntelMQ would have to publish such configuration to be compliant. For shadowserver parser that seems ok, for generic parser it could be a blocker.

dmth commented 8 years ago

Yes, that would match our idea. Please note: Functions for type-translation are only designed for one parameter right now. That will not be sufficient.

Your licensing concerns are valid if translation-functions are created (bc. they are new code). If the file does only contain configuration it would no be subject to AGPL. But I can only assume that from my point of view, and I'm not a lawyer.

sebix commented 8 years ago

Please note: Functions for type-translation are only designed for one parameter right now. That will not be sufficient.

And they only have one return value. That could be problematic too if we save one JSON-value in two of our fields.

aaronkaplan commented 8 years ago

So, ideally (let's free our minds for a moment from formats) a transformation system looks like this:

time.source = timestamptodatetime(time)        # comment: transform via the function
source.ip = ip                                 # no transformation, simple assignment
classification.type = type_translation(type, translation_table, default="unknown")   # example with multiple parameters, where translation_table of course needs to be defined.

The first parameter of any translation function must be the source value which gets translated. The other parameters of the function may denote optional parameters which are needed for a proper translation. default= values are OK. The translation system must not contain loops or if-thens (so that every translation can be handled in a fixed amount of time (we can give an upper bound)).

I believe this type of syntax would be the most readable, the most direct. And hey, actually it can be a subset of regular python code....

Another alternative (for JSON transformations) might be the jq syntax. But that is not a generic transformation syntax. It is only specific for JSON.

robcza commented 7 years ago

Ok, I think we can go with the python notation. The config file will be loaded dynamically (name configured as a runtime parameter). Functions could be defined in the config as well, however such configurations has to be published to be AGPL compliant. I prefer functions to be defined in the parser itself and documented.

sebix commented 7 years ago

How do you want to do the configuration of a hierarchical structure as JSON? The example in https://github.com/certtools/intelmq/issues/553#issuecomment-230335760 is not hierarchical.

robcza commented 7 years ago

I've been playing with the python native notation and tried to achieve something similar, what Aaron proposed https://github.com/certtools/intelmq/issues/553#issuecomment-230456234 The result was quite disappointing and I would like to try YAML with the function notation:

- translate:
    source.ip: ip 
    time.source: timestamptodatetime(time)
    classification.type: type_translation(type, translation_table, default)
- parameters:
    translation_table:
      bad thing: malware
      unknown: blacklist
   default: blacklist

However, multiple function parameters could introduce some unwanted complexity. What property do you mean "hierarchical"? Could you post an example please?

sebix commented 7 years ago

We have an existing parser for fraunhofer dga. It's testdata looks like this:

{
  "banjori_dga_andersensinaix.com_0x3c03": [
    "andersensinaix.com",
    "xjsrrsensinaix.com",
    "hlrfrsensinaix.com",
    "fnosrsensinaix.com",
    "128.238.197.33",
    "lbzorsensinaix.com",
    "sgjprsensinaix.com"
  ]
}

How would the configuration look like?

What other data sources using JSON do you have in mind to use the parser with?

certtools / intelmq

generic json parser #553