elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
112 stars 4.93k forks source link

FileBeat: decode_json_fields processor max_depth option not working #19830

Open ghost opened 4 years ago

ghost commented 4 years ago

To prevent creating tons of document fields in an Elasticsearch log index I want to control nested JSON parsing depth.

Related discussion post: https://discuss.elastic.co/t/filebeat-decode-json-fields-processor-max-depth-option-not-working/240948

Filebeat version 7.8.0 (also tested on 6.8.10 and result is the same)

/tmp/filebeat.conf:

filebeat.inputs:
- type: log
  paths:
    - /tmp/filebeat.input

processors:
  - decode_json_fields:
      fields: ["message"]
      max_depth: 1
      target: "parsed"

output.console:
  pretty: true

/tmp/filebeat.input:

{"top": "top_value", "top_obj": {"level_1": "level_1_value", "level_1_obj": {"level_2": "level_2_value", "level_2_obj": {"level_3": "level_3_value"}}}}

Command:

filebeat  -e -c /tmp/filebeat.conf

Result:

"parsed": {
  "top_obj": {
    "level_1_obj": {
      "level_2": "level_2_value",
      "level_2_obj": {
        "level_3": "level_3_value"
      }
    },
    "level_1": "level_1_value"
  },
  "top": "top_value"
}

Expected result:

"parsed": {
  "top_obj": {
    "level_1_obj": "{\"level_2\": \"level_2_value\", \"level_2_obj\": {\"level_3\": \"level_3_value\"}}",
    "level_1": "level_1_value"
  },
  "top": "top_value"
}
KhaledSakr commented 4 years ago

It works properly. It just doesn't do what you're expecting.

Your json input doesn't have nested json. If you parse it in the browser with JSON.parse, you'll find the following: image

To get your desired effect, your level_1_obj value itself will have to be stringified first

"level_1_obj":"{\"level_2\":\"level_2_value\",\"level_2_obj\":{\"level_3\":\"level_3_value\"}}"

What max_depth does is recursively trying to decode the underlying fields until the max_depth is hit. So if you set it to 2, it will still be able to decode "level_1_obj":"{\"level_2\":\"level_2_value\",\"level_2_obj\":{\"level_3\":\"level_3_value\"}}"

ghost commented 4 years ago

Anyway, the documentation is not clear enough for me. And I suppose not only for me but for many other users. The max_depth option behaves more like a limit option to prevent stack overflow but not for parsing JSON to N level depth and leave all next levels as an unparsed string. I implemented the functional with logstash + ruby plugin. And did all necessary parsing logic with the ruby script. Now I have only the first 2 levels as document fields in Elasticsearch indexes. All next subfields stored as a string value of the fields.

milesich commented 4 years ago

I understood it exactly as @vitaliy-kravchenko. The max_depth option behaves as a limit to prevent mapping explosion.

caiobegotti commented 3 years ago

I have tons of respect for Filebeat and I use it in multiple projects as a collector but I just spent 3 days trying to debug this until I found this issue and I agree it looks like the documentation is not clear at all about this. While we're at it, I'm not sure expecting message to be stringfied so this can work properly is reasonable. I've never seen logs that are like that. Right now I'm trying to fix this problem by doing some Elasticsearch's ingest pipeline trickery but it's depressing as Filebeat is so much better than ES pipelines despite of this issue... 😢

eedugon commented 3 years ago

@sayden : I guess this issue is important to provide a reliable way to prevent mapping explosions.

I'm creating some configuration references to index our own beats logs (running on Kubernetes) in Elasticsearch. With the json logging support (logging.json: true) this is very straight forward and the logs can be decoded just by using the decode_json_fields.

With max_depth: 1 the objective should be (apparently) to have only the first level of fields decoded (level, timestamp, logger, caller, message, monitoring, etc), and if any of these are json objects they shouldn't be expanded to fields.

As you know, the monitoring part of our log messages (in Filebeat or Metricbeat for example) is a big json object with a lot of sub-objects. Expanding them always makes very difficult to index our own logs in a nice way in Elasticsearch (we definitely don't want to create all monitoring sub-fields in a filebeat index, as it doesn't make any sense, but keeping the long monitoring strings as references makes sense).

I don't know if this is considered a bug or not (@adriansr might have a different view), but just for your consideration!

dennybaa commented 3 years ago

same behavior:

filebeatConfig:
  filebeat.yml: |
    ## Hints based autodiscover
    filebeat.autodiscover:
      providers:
        - type: kubernetes
          node: ${NODE_NAME}
          hints.enabled: true

    processors:
      - if:
          equals:
            kubernetes.labels.app/logs_json: "true"
        then:
        - decode_json_fields:
            fields: ["message"]
            process_array: false
            max_depth: 1
            target: "foo"
            overwrite_keys: true
            add_error_key: true

Both process_array and max_depth have no effect on nesting and json parsing, i.e. the whole JSON object is always parsed :(

PS: INFO [beat] instance/beat.go:1023 Build info {"system_info": {"build": {"commit": "e127fc31fc6c00fdf8649808f9421d8f8c28b5db", "libbeat": "7.14.0", "time": "2021-07-29T20:56:59.000Z", "version": "7.14.0"}}}

rupppe commented 3 years ago

Same here. This is currently a show stopper for us since we have complex data in some fields which are not to be decoded.

Setting max_depth has no effect. The whole sub-structure is decoded into fields.

Kosmonafft commented 2 years ago

I experience the same with 7.16.2 max_depth has no effect on the parsing of the json logs.

ChuckNoxis commented 2 years ago

I experience the same with 7.16.2 max_depth has no effect on the parsing of the json logs.

It seems to be the same on 7.16.3

lebenitza commented 2 years ago

This is really important for us too as going more than 1, 2 levels will eventually break the ES index template mapping and logs will start being dropped and we cannot expect to always change how logs are escaped depending on the depth we want to achieve.

Even if filebeat is able to fully parse a document from the start, we were expecting this setting to properly adjust the mappings and also save as string anything beyond the value set.

JoeAshworth commented 2 years ago

We're also experiencing this problem. This is much needed functionality, it would seem.

rcarpa commented 2 years ago

As a workaround: Isn't it straightforward to use the "script" processor to implement the desired functionality? Either by first applying the decode_json_fields processor, then re-encoding fields into json from javascript; or by doing everything in javascript?

JasonSwarm commented 2 years ago

As a workaround: Isn't it straightforward to use the "script" processor to implement the desired functionality? Either by first applying the decode_json_fields processor, then re-encoding fields into json from javascript; or by doing everything in javascript?

Good idea, I do exactly like what you said, and it works well. Here is my code:

- script:
    lang: javascript
    source: >
      function process(event) {
          for(var p in event.Get("data")){
            if (event.Get("data")[p] != null && typeof event.Get("data")[p] == 'object') {
              event.Put("data."+p, JSON.stringify(event.Get("data")[p]))
            }
          }
      }
Kosmonafft commented 1 year ago

Does anyone know if this problem persist in filebeat or elastic agent 8?

usersina commented 9 months ago

I also have the same problem using ECK