fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.85k stars 1.58k forks source link

Add option to JSON parser to set Parent_Key to avoid parsing onto the Root of the Data Structure #7677

Closed meehanman closed 10 months ago

meehanman commented 1 year ago

Is your feature request related to a problem? Please describe. Currently when using the JSON parser if the original log source is a JSON map string, it will take its structure and convert it directly to the internal binary representation.

eg.

{"key1": 12345, "key2": "abc", "time": "2006-07-28T13:22:04Z"}

will be processed to:

[1154103724, {"key1"=>12345, "key2"=>"abc"}]

This same functionality is repeated for all other parsers eg. if you parse Regular Expression

^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$

will parse

192.168.2.20 - - [29/Jul/2015:10:27:10 -0300] "GET /cgi-bin/try/ HTTP/1.0" 200 3395

to

[1154104030, {"host"=>"192.168.2.20",
              "user"=>"-",
              "method"=>"GET",
              "path"=>"/cgi-bin/try/",
              "code"=>"200",
              "size"=>"3395",
              "referer"=>"",
              "agent"=>""
              }
]

The difference between parsing Regular Expression and JSON is that we are able to name the fields in our REGEX and extract these fields having full control of how they are extracted.

With the ability to set custom naming for Regular Expression fields we are then able to use other Filters such as Nest that allows us to nest any fields by name that we specified or nest all fields that start with a prefix eg. nest_.

With the JSON parser this isn't possible and the fields are just extracted onto the Root Data Structure.

Describe the solution you'd like

Specifically JSON parsing should have the ability to choose to not parse directly onto the Root Data Structure and into field within the data structure or to prefix the JSON extracted root keys so they can be identified later in the Data Pipeline for processing.

For example:

Parent_Key

[PARSER]
    Name        docker
    Format      json
    Time_Key    time
    Parent_key  event
    Time_Format %Y-%m-%dT%H:%M:%S %z

...
{"key1": 12345, "key2": "abc", "time": "2006-07-28T13:22:04Z"}

will be processed to:

[1154103724, {"event": {"key1"=>12345, "key2"=>"abc"}}]

Extracted_key_prefix

[PARSER]
    Name                 docker
    Format               json
    Time_Key             event_time
    Extracted_key_prefix event_
    Time_Format          %Y-%m-%dT%H:%M:%S %z

...
{"key1": 12345, "key2": "abc", "time": "2006-07-28T13:22:04Z"}

will be processed to:

[1154103724, {"event_key1"=>12345, "event_key2"=>"abc"}]

An alternative approach would be implement this on the Parsing Filter/Inputs with Parsers that would then take effect for both the JSON parsers and other parsers. This will allow parsing JSON to be predictable when processing logs from many or unknown sources.

Describe alternatives you've considered

The only real alternative to this problem for JSON parsing is to:

  1. Before Parsing - Prefix all current keys with prefix_
  2. Parse - Parse prefix_log to JSON
  3. After Parsing - All keys with prefix_ can be identified as not being from the parsed JSON

We can then use a Lua filter to process the logs to conform to the correct format we are looking for without the chance of overriding keys that may be set on the Root Data Structure. There are no other 'native' ways to do this without ending up with duplicate keys.

Additional context

The Splunk_HEC forwarder does a poor job at formatting logs correctly so that metadata added by Fluentbit is only in the fields metadata with all fields ending up as the event including host, source, sourceType, index. This causes additional log size and storage/processing costs.

To better have control of what is sent, we utilise the HTTP collector, which gives us more control of exactly what we are sending. For additional context; the Splunk HEC event endpoint only will accept logs formatted in the following structure:

{
    "time": 1426279439, // epoch time
    "host": "localhost",
    "source": "random-data-generator",
    "sourcetype": "my_sample_data",
    "index": "main",
    "fields": {"billing_team": "custom_team", "internal_field": "internal_value"}
    "event":  "Hello world!" // Can be a string or object
}

(Note: Most of these fields can be optional)

We are using Fluentbit to set log context via the Fields key. All Log data should be processed to be under the Event key.

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 10 months ago

This issue was closed because it has been stalled for 5 days with no activity.

icy commented 8 months ago

I'm also interested in this feature. Thanks.

PS: I think this can be done with nest operations btw