Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.42k stars 1.07k forks source link

Beats input strips JSON arrays containing JSON objects from Packetbeat messages. #18416

Open miwent opened 8 months ago

miwent commented 8 months ago

Packetbeat fields that contain a JSON array containing JSON objects have the contain JSON objects removed, resulting in an empty field.

Expected Behavior

Packetbeat fields that contain a JSON array of JSON objects retain those values. The field should at least be stored as a string in order to allow parsing and processing with the JSON processing tools in the pipeline.

Current Behavior

Packetbeat DNS logs contain a field packetbeat_dns_answers that is a list of the records that are captured when a DNS query+response are logged. This field is logged in Graylog as an empty list:

image

Configuring packetbeat to output to the console shows that the field does have a value though:

"dns": {
    "additionals_count": 0,
    "answers": [
      {
        "class": "IN",
        "data": "2001:4998:124:1507::f001",
        "name": "yahoo.com",
        "ttl": "114",
        "type": "AAAA"
      },
      {
        "class": "IN",
        "data": "2001:4998:44:3507::8000",
        "name": "yahoo.com",
        "ttl": "114",
        "type": "AAAA"
      },
      {
        "class": "IN",
        "data": "2001:4998:44:3507::8001",
        "name": "yahoo.com",
        "ttl": "114",
        "type": "AAAA"
      },
      {
        "class": "IN",
        "data": "2001:4998:24:120d::1:0",
        "name": "yahoo.com",
        "ttl": "114",
        "type": "AAAA"
      },
      {
        "class": "IN",
        "data": "2001:4998:124:1507::f000",
        "name": "yahoo.com",
        "ttl": "114",
        "type": "AAAA"
      },
      {
        "class": "IN",
        "data": "2001:4998:24:120d::1:1",
        "name": "yahoo.com",
        "ttl": "114",
        "type": "AAAA"
      }
    ],

I tested this by adding a field in the Packetbeat config:

processors:
  - add_fields:
      target: _testjson
      fields:
        name: [{"testkey0": "value"},{"testkey1":"value1"}]

And the field was added but it was also added as an empty array:

image

Possible Solution

Steps to Reproduce (for bugs)

  1. Download, configure, and run the latest Packetbeat agent and send logs to the Graylog beats input
  2. Generate sample DNS traffic where the Packetbeat agent will see it
  3. Examine the logs sent to Graylog, specifically the packetbeat_dns_answers field

Context

The DNS answers is one of the more meaningful bits of information. It is almost impossible to reconstruct this information reliably from the remaining fields that are indexed.

Your Environment

waab76 commented 7 months ago

Don't necessarily have to parse the JSON array, we can just store the value as a string so Content can make use of it.

ryan-carroll-graylog commented 7 months ago

@miwent digging into this it looks like for this input when we have a log containing nested json the input is breaking that out into separate message fields. For arrays it's appending an index to the field name.

So for the example you provided the packetbeat_dns_answers field is empty but it's contents are stored in packetbeat_dns_answers_0_class, packetbeat_dns_answers_0_data, packetbeat_dns_answers_1_class etc. So the data is there albeit maybe not where expected.

Just wanted clarify if you were tracking this and it's insufficient and we additionally want the packetbeat_dns_answers field populated with a string representation of the JSON array?

Here's a full example of current behavior:

The incoming Packetbeat log:

{
  "@metadata": {
    "beat": "packetbeat"
  },
  "@timestamp": "2016-04-01T00:00:00.000Z",
  "beat": {
    "hostname": "example.local",
    "name": "example.local"
  },
  "bytes_in": 35,
  "bytes_out": 51,
  "client_ip": "192.168.0.10",
  "client_port": 57935,
  "client_proc": "",
  "client_server": "",
  "count": 1,
  "direction": "out",
  "dns": {
    "additionals_count": 0,
    "answers": [
      {
        "class": "IN",
        "data": "2001:4998:124:1507::f001",
        "name": "yahoo.com",
        "ttl": "114",
        "type": "AAAA"
      },
      {
        "class": "IN",
        "data": "2001:4998:44:3507::8000",
        "name": "yahoo.com",
        "ttl": "114",
        "type": "AAAA"
      },
      {
        "class": "IN",
        "data": "2001:4998:44:3507::8001",
        "name": "yahoo.com",
        "ttl": "114",
        "type": "AAAA"
      },
      {
        "class": "IN",
        "data": "2001:4998:24:120d::1:0",
        "name": "yahoo.com",
        "ttl": "114",
        "type": "AAAA"
      },
      {
        "class": "IN",
        "data": "2001:4998:124:1507::f000",
        "name": "yahoo.com",
        "ttl": "114",
        "type": "AAAA"
      },
      {
        "class": "IN",
        "data": "2001:4998:24:120d::1:1",
        "name": "yahoo.com",
        "ttl": "114",
        "type": "AAAA"
      }
    ],
    "answers_count": 1,
    "authorities_count": 0,
    "flags": {
      "authoritative": false,
      "recursion_allowed": true,
      "recursion_desired": true,
      "truncated_response": false
    },
    "id": 9819,
    "op_code": "QUERY",
    "question": {
      "class": "IN",
      "name": "www3.l.google.com",
      "type": "A"
    },
    "response_code": "NOERROR"
  },
  "ip": "192.168.0.1",
  "method": "QUERY",
  "port": 53,
  "proc": "",
  "query": "class IN, type A, www3.l.google.com",
  "resource": "www3.l.google.com",
  "responsetime": 15,
  "server": "",
  "status": "OK",
  "transport": "udp",
  "type": "dns"
}

is currently parsed into these message fields:

{
  "packetbeat_beat_name": "example.local",
  "packetbeat_bytes_in": 35,
  "packetbeat_dns_answers_3_ttl": "114",
  "packetbeat_method": "QUERY",
  "packetbeat_type": "dns",
  "packetbeat_query": "class IN, type A, www3.l.google.com",
  "packetbeat_dns_answers_1_name": "yahoo.com",
  "packetbeat_dns_answers_4_ttl": "114",
  "packetbeat_dns_answers_count": 1,
  "source": "example.local",
  "packetbeat_dns_answers_3_data": "2001:4998:24:120d::1:0",
  "packetbeat_dns_answers_4_type": "AAAA",
  "packetbeat_dns_answers_2_ttl": "114",
  "packetbeat_direction": "out",
  "packetbeat_dns_flags_truncated_response": false,
  "packetbeat_dns_answers_5_ttl": "114",
  "packetbeat_dns_answers_2_data": "2001:4998:44:3507::8001",
  "packetbeat_@metadata_beat": "packetbeat",
  "packetbeat_dns_flags_authoritative": false,
  "packetbeat_status": "OK",
  "packetbeat_dns_answers_5_class": "IN",
  "packetbeat_ip": "192.168.0.1",
  "packetbeat_dns_answers_1_ttl": "114",
  "packetbeat_dns_answers_2_type": "AAAA",
  "packetbeat_dns_answers_3_name": "yahoo.com",
  "packetbeat_dns_answers_0_name": "yahoo.com",
  "packetbeat_dns_flags_recursion_desired": true,
  "packetbeat_transport": "udp",
  "packetbeat_dns_authorities_count": 0,
  "packetbeat_resource": "www3.l.google.com",
  "packetbeat_@timestamp": "2016-04-01T00:00:00.000Z",
  "packetbeat_dns_answers_2_class": "IN",
  "packetbeat_dns_answers_5_type": "AAAA",
  "packetbeat_dns_answers_0_ttl": "114",
  "packetbeat_dns_question_type": "A",
  "packetbeat_dns_answers_1_data": "2001:4998:44:3507::8000",
  "packetbeat_dns_id": 9819,
  "_id": "e99207b0-f8d8-11ee-b05d-ba5db92f0eae",
  "packetbeat_dns_answers_4_data": "2001:4998:124:1507::f000",
  "packetbeat_dns_answers_0_data": "2001:4998:124:1507::f001",
  "packetbeat_responsetime": 15,
  "packetbeat_dns_answers_4_name": "yahoo.com",
  "packetbeat_dns_question_name": "www3.l.google.com",
  "packetbeat_dns_additionals_count": 0,
  "packetbeat_dns_answers_0_type": "AAAA",
  "beats_type": "packetbeat",
  "packetbeat_dns_answers_3_class": "IN",
  "packetbeat_dns_answers_5_data": "2001:4998:24:120d::1:1",
  "packetbeat_dns_response_code": "NOERROR",
  "packetbeat_client_ip": "192.168.0.10",
  "packetbeat_dns_flags_recursion_allowed": true,
  "packetbeat_dns_question_class": "IN",
  "packetbeat_dns_answers_0_class": "IN",
  "packetbeat_client_port": 57935,
  "timestamp": "2016-04-01T00:00:00.000Z",
  "packetbeat_dns_answers_1_type": "AAAA",
  "packetbeat_dns_answers_4_class": "IN",
  "packetbeat_dns_op_code": "QUERY",
  "packetbeat_bytes_out": 51,
  "packetbeat_beat_hostname": "example.local",
  "packetbeat_dns_answers": [],
  "message": "-",
  "packetbeat_dns_answers_1_class": "IN",
  "packetbeat_dns_answers_5_name": "yahoo.com",
  "packetbeat_count": 1,
  "packetbeat_dns_answers_3_type": "AAAA",
  "packetbeat_dns_answers_2_name": "yahoo.com",
  "packetbeat_port": 53
}
miwent commented 7 months ago

@ryan-carroll-graylog by removing the source JSON array contents we can't do any additional processing - in this case, we would want to use a jsonpath() statement to put all of the answers as values in to one field, and possibly the same for the record types, etc. This would be pretty easy to do with jsonpath() using the original JSON but trying to rebuild that data based on the flattened fields would be imprecise and difficult.

If possible, it would be nice to always retain the source data but provide an option to flatten arrays containing JSON, but we at least need to have the original JSON vaules in the arrays.

ryan-carroll-graylog commented 7 months ago

If possible, it would be nice to always retain the source data but provide an option to flatten arrays containing JSON, but we at least need to have the original JSON vaules in the arrays.

Ah gotcha totally makes sense @miwent, it should be pretty straightforward to add the answers array in tact from looking at the input.

I'll see about making the flattening behavior optional across the board in the input config too.

bernd commented 7 months ago

@ryan-carroll-graylog Please ensure that the changes we make don't break or change the behavior of the input for existing users. :slightly_smiling_face:

ryan-carroll-graylog commented 7 months ago

@ryan-carroll-graylog Please ensure that the changes we make don't break or change the behavior of the input for existing users. 🙂

For sure @bernd, good call out! Will keep a close eye on backwards compatibility.

Do you see any issue with populating the flattened fields in addition to flattening them? This would be a change in behavior but strictly additive.

Another option is a new "Flatten JSON" config option that defaults to the current behavior but that could be toggled off to store only top level fields with their raw json objects.

A third option that just occurred to me, and let me know if this would work for you @miwent, is just adding the standard "Store full message" option, defaulting to off. This seems like the least intrusive and most flexible option to me if it provides us what we need for Illuminate.

miwent commented 7 months ago

@ryan-carroll-graylog Please ensure that the changes we make don't break or change the behavior of the input for existing users. 🙂

For sure @bernd, good call out! Will keep a close eye on backwards compatibility.

Do you see any issue with populating the flattened fields in addition to flattening them? This would be a change in behavior but strictly additive.

Another option is a new "Flatten JSON" config option that defaults to the current behavior but that could be toggled off to store only top level fields with their raw json objects.

A third option that just occurred to me, and let me know if this would work for you @miwent, is just adding the standard "Store full message" option, defaulting to off. This seems like the least intrusive and most flexible option to me if it provides us what we need for Illuminate.

My concern with the option of storing a full message would be that the beats logs already are quite large and users may not be happy with what would effectively double the message size. Having the option to store full messages would probably be nice in general, not having the original fields would still be an issue on the Illuminate side since I would expect many (if not most) customers to decide not to store the full message. Illuminate would still require the original fields/values exist for Illuminate in that case.

bernd commented 7 months ago

What do you think about the following? @ryan-carroll-graylog @miwent

We add #setOriginalSource() and #getOriginalSource() methods to the Message class that set and return the original message as bytes or string. In addition to that, we add has_original_source and get_original_source (with an optional default return value) pipeline functions to check and retrieve the original source value.

In input codecs, we can add the original source to each message object. Not into the message fields map, but either as a separate field or the existing metadata map. That way the original source doesn't get indexed into OpenSearch.

The main drawback is that the memory consumption of the in-flight Message object increases. We can benchmark that to see how big of an impact that is. We can also add an option to inputs to disable the storage of the original source data in the in-memory message object. (default is enabled)

Would that help with the packetbeat use case?

/cc @Graylog2/architecture

ryan-carroll-graylog commented 7 months ago

@bernd that's a really awesome idea! I think it would solve a lot of the issues we've run into with Illuminate where we've needed the full message and had to resort to using the full_message field, which as @miwent pointed out can be a pain point with users.

Definitely would be a different way of doing things on the Illuminate parsing side so curious on your thoughts @miwent.

/cc @kingzacko1 @danotorrey for y'all's thoughts too.

miwent commented 6 months ago

@bernd I second that, it would work for Packetbeat and I could see it being useful in other scenarios.

ryan-carroll-graylog commented 6 months ago

@miwent what is the timeline that a solution is needed for this?

Since the scope of @bernd's proposal is bigger than what we additionally had allotted for this issue for this cycle (should get @Graylog2/architecture buy in, as well as syncing up on how this needs to work on for Illuminate to use the pipeline rules, extra dedicated testing), the team was thinking we need to push the original source changes to a later cycle.

Just want to make sure that doesn't interfere with any planned Illuminate releases. If it does, maybe explore some short term solutions.

/cc @rich-graylog @waab76

miwent commented 6 months ago

@miwent what is the timeline that a solution is needed for this?

Since the scope of @bernd's proposal is bigger than what we additionally had allotted for this issue for this cycle (should get @Graylog2/architecture buy in, as well as syncing up on how this needs to work on for Illuminate to use the pipeline rules, extra dedicated testing), the team was thinking we need to push the original source changes to a later cycle.

Just want to make sure that doesn't interfere with any planned Illuminate releases. If it does, maybe explore some short term solutions.

/cc @rich-graylog @waab76

We already have a temporary workaround in place, it's not ideal but it does allow us to be patient with this.

ryan-carroll-graylog commented 6 months ago

Created a new issue to track Bernd's originalSource idea. Added it to the TDIR cycle 3 board "Waiting Prioritization" @rich-graylog but feel free to move it.