elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
68 stars 3.5k forks source link

unpacking json to toplevel overrides @metadata #15323

Open Mekk opened 1 year ago

Mekk commented 1 year ago

Problem I faced in rather simple setup:

It turned out, that in spite of using decorate_events => "basic", [@metadata][kafka] is not available. And it took me quite a lot of time to find out - why. Looks like json filter, while unpacking, removed @metadata block.

Relevant parts of the config (my actual config is more complicated but other elements are irrelevant here):

input {
     kafka {
            # … bootstrap_servers, topics, group_id, etc… 

            decorate_events => "basic"          # adds [@metadata][kafka][topic], …
     } 
}

filter {
     json {
         source => "message"
         # no
         #   target =>  … 
         #  so fields are unpacked to toplevel. This is on purpose
         # as this is what we want in this setup - restore filebeat record after kafka.
    }
}

filter {
     mutate {
         copy =>  { "[@metadata][kafka][topic]" => "[kafka][topic]" }
     }
}        

output {
    elasticsearch {
         #   …
    }
}

I expected to see kafka.topic in the results, but this field was simply missing.

It turned out that replacing first filter with

filter {
    mutate {
        rename => { "[@metadata][kafka]" => "[@metabackup][kafka]" }
    }
    json {
         source => "message"
    }
    mutate {
        rename => { "[@metabackup][kafka]" => "[@metadata][kafka]" }
        remove_field => [ "[@metabackup]" ]
    }
}

helped¹, so looks like json filter removed @metadata for some reason.

This reason is even more unclear as it seems to me that no fields appeared under @metadata (so it is not even the case of „there was @metadata.something in filebeat output and this is why json replaced this block) - at least it seems so to me after some glazing at rubydebug output..

Once the problem is known, it is rather easy to workaround (for example as above) but for unaware it is very confusing.

¹ Of course swapping filter order would likely help too but in my case actual processing of kafka metadata was different and needed both kafka metadata and unpacked fields.

Mekk commented 1 year ago

Preferable solution:

a) json filter doesn't remove @metadata when there are no @metadata.something fields in unpacked record b) even if unpacked record happens to contain @metadata.something, json filter tries to merge those blocks (@metadata is special, this is more-or-less set of local variables), or if this is too difficult, this override is avoided in some other way (for example by renaming unpacked @metadata into sth else)

Minimal solution:

c) if the behaviour is to stay, there is clear warning in json filter docs that using this filter in toplevel mode (without target) removes @metadata which must be preserved separately if needed.

neu5ron commented 11 months ago

is it possible that the json filter can call the ruby event api to_json_with_metadata.

the @metadata issue it causes a whole host of problems when using beats/agent too to kafka or logstash for that matter. I found a work around but will probably require changing your pipeline around as it did mine. I also love how ecs_compatibility duplicates message sizes because of message and event.original. Also the verbiage of everything is unfortunate.. as message was one thing for years and was one thing to deal with data sources also used message. but now event.original which by definition would seem the original/raw, is in fact not. Is rather a variation usually of the "payload" of the data, ie: thing not like if it had added fields (agent, host, file, etc.). Then becomes really difficult to parse through what is original/raw when something was already in JSON and so forth and has fields added by agents and such.

Anyways, There are 6 options (assuming you're using logstash 8.x). I think only 1 solution for myself which is # 2 but you may be able to use # 3 or # 4. Hope the below table helps anybody else making the choice.. All of them have cons, but my necessity to be able to reduce ram/heap at hundreds of thousands of EPS, compatibility with ecs ingest pipelines and custom necessary parsers, and control the data and it's integrity and it's flow appropriately - i am left with option # 2.

number simple code not specifying ecs_compatbility assuming logstash v8 full code specifying ecs_compatbility assuming logstash v8 codec ecs_compatibility target original/raw kept? if applicable, what fields JSON of original/raw? if applicable, what/where is it placed keeps original/raw @metadata if applicable, how is it kept? use ? pro con
1 codec => plain codec => plain { ecs_compatibility => "v8" } plain v8 n/a message event.original n/a raw in message raw in event.original - ability to control @metadata merging - ability to control raw - ability to control json - compatibility with certain ingest pipelines, as some think event.original is the raw json and some think is the message/log payload - dup of raw (message & event.original), heap overhead of the log - requires manually controlling json - json_encode plugin is not installed automatically for logstash - or use ruby - manual merge @metadata - pia to manage message - pia to manage event.original
2 n/a codec => plain{ ecs_compatibility => "disabled" } plain disabled n/a message n/a raw in message - no dup of raw, lower heap overhead - ability to control @metadata merging - ability to control raw - ability to control json - compatibility with certain ingest pipelines, as some think event.original is the raw json and some think is the message/log payload - requires manually controlling json - via filters (however, json_encode plugin is not installed automatically for logstash) - json w/ target to temp_target - json_encode over temp_target - json w/ no target, remove temp_target (will finally merge to root) - via ruby, using to_hash_with_metadata - manual merge @metadata - pia to manage log with message contained within it
3 codec => json codec => json { ecs_compatibility => "v8" } json v8 n/a event.original root merged to root raw in event.original - automatically merge @metadata - automatically controls json - difficult/impossible to unwind/controll what was added to root if needed after the fact - changing event.original can be impossible to figure out what was from root and what was not - unable to control @metadata merging - difficulty with compatibility with certain ingest pipelines, as some think event.original is the raw json and some think is the message/log payload
4 n/a codec => json { ecs_compatibility => "v8" target => "nested_target" } json v8 nested_target event.original nested_target raw in nested_target raw in event.original - ability to control @metadata merging - ability to control raw - ability to control json - compatibility with certain ingest pipelines, as some think event.original is the raw json and some think is the message/log payload - requires manually controlling json - via filters (json_encode plugin is not installed automatically for logstash) - or use ruby - manual merge @metadata - if not needing nested target, as for some ingest pipelines, heap overhead of the log
5 n/a codec => json { ecs_compatibility => "disabled" target => "nested_target" } json disabled nested_target n/a nested_target raw in nested_target can't use can't use
6 n/a codec => json { ecs_compatibility => "disabled" } json disabled n/a n/a root merged to root can't use can't use
neu5ron commented 10 months ago

to add to this, ingest pipelines sometimes expect event.original and sometimes expect message. sometimes copy message to event.original and remove message. sometimes copy event.original to message but still use message.