Open miwent opened 8 months ago
Don't necessarily have to parse the JSON array, we can just store the value as a string so Content can make use of it.
@miwent digging into this it looks like for this input when we have a log containing nested json the input is breaking that out into separate message fields. For arrays it's appending an index to the field name.
So for the example you provided the packetbeat_dns_answers
field is empty but it's contents are stored in packetbeat_dns_answers_0_class
, packetbeat_dns_answers_0_data
, packetbeat_dns_answers_1_class
etc. So the data is there albeit maybe not where expected.
Just wanted clarify if you were tracking this and it's insufficient and we additionally want the packetbeat_dns_answers
field populated with a string representation of the JSON array?
Here's a full example of current behavior:
The incoming Packetbeat log:
{
"@metadata": {
"beat": "packetbeat"
},
"@timestamp": "2016-04-01T00:00:00.000Z",
"beat": {
"hostname": "example.local",
"name": "example.local"
},
"bytes_in": 35,
"bytes_out": 51,
"client_ip": "192.168.0.10",
"client_port": 57935,
"client_proc": "",
"client_server": "",
"count": 1,
"direction": "out",
"dns": {
"additionals_count": 0,
"answers": [
{
"class": "IN",
"data": "2001:4998:124:1507::f001",
"name": "yahoo.com",
"ttl": "114",
"type": "AAAA"
},
{
"class": "IN",
"data": "2001:4998:44:3507::8000",
"name": "yahoo.com",
"ttl": "114",
"type": "AAAA"
},
{
"class": "IN",
"data": "2001:4998:44:3507::8001",
"name": "yahoo.com",
"ttl": "114",
"type": "AAAA"
},
{
"class": "IN",
"data": "2001:4998:24:120d::1:0",
"name": "yahoo.com",
"ttl": "114",
"type": "AAAA"
},
{
"class": "IN",
"data": "2001:4998:124:1507::f000",
"name": "yahoo.com",
"ttl": "114",
"type": "AAAA"
},
{
"class": "IN",
"data": "2001:4998:24:120d::1:1",
"name": "yahoo.com",
"ttl": "114",
"type": "AAAA"
}
],
"answers_count": 1,
"authorities_count": 0,
"flags": {
"authoritative": false,
"recursion_allowed": true,
"recursion_desired": true,
"truncated_response": false
},
"id": 9819,
"op_code": "QUERY",
"question": {
"class": "IN",
"name": "www3.l.google.com",
"type": "A"
},
"response_code": "NOERROR"
},
"ip": "192.168.0.1",
"method": "QUERY",
"port": 53,
"proc": "",
"query": "class IN, type A, www3.l.google.com",
"resource": "www3.l.google.com",
"responsetime": 15,
"server": "",
"status": "OK",
"transport": "udp",
"type": "dns"
}
is currently parsed into these message fields:
{
"packetbeat_beat_name": "example.local",
"packetbeat_bytes_in": 35,
"packetbeat_dns_answers_3_ttl": "114",
"packetbeat_method": "QUERY",
"packetbeat_type": "dns",
"packetbeat_query": "class IN, type A, www3.l.google.com",
"packetbeat_dns_answers_1_name": "yahoo.com",
"packetbeat_dns_answers_4_ttl": "114",
"packetbeat_dns_answers_count": 1,
"source": "example.local",
"packetbeat_dns_answers_3_data": "2001:4998:24:120d::1:0",
"packetbeat_dns_answers_4_type": "AAAA",
"packetbeat_dns_answers_2_ttl": "114",
"packetbeat_direction": "out",
"packetbeat_dns_flags_truncated_response": false,
"packetbeat_dns_answers_5_ttl": "114",
"packetbeat_dns_answers_2_data": "2001:4998:44:3507::8001",
"packetbeat_@metadata_beat": "packetbeat",
"packetbeat_dns_flags_authoritative": false,
"packetbeat_status": "OK",
"packetbeat_dns_answers_5_class": "IN",
"packetbeat_ip": "192.168.0.1",
"packetbeat_dns_answers_1_ttl": "114",
"packetbeat_dns_answers_2_type": "AAAA",
"packetbeat_dns_answers_3_name": "yahoo.com",
"packetbeat_dns_answers_0_name": "yahoo.com",
"packetbeat_dns_flags_recursion_desired": true,
"packetbeat_transport": "udp",
"packetbeat_dns_authorities_count": 0,
"packetbeat_resource": "www3.l.google.com",
"packetbeat_@timestamp": "2016-04-01T00:00:00.000Z",
"packetbeat_dns_answers_2_class": "IN",
"packetbeat_dns_answers_5_type": "AAAA",
"packetbeat_dns_answers_0_ttl": "114",
"packetbeat_dns_question_type": "A",
"packetbeat_dns_answers_1_data": "2001:4998:44:3507::8000",
"packetbeat_dns_id": 9819,
"_id": "e99207b0-f8d8-11ee-b05d-ba5db92f0eae",
"packetbeat_dns_answers_4_data": "2001:4998:124:1507::f000",
"packetbeat_dns_answers_0_data": "2001:4998:124:1507::f001",
"packetbeat_responsetime": 15,
"packetbeat_dns_answers_4_name": "yahoo.com",
"packetbeat_dns_question_name": "www3.l.google.com",
"packetbeat_dns_additionals_count": 0,
"packetbeat_dns_answers_0_type": "AAAA",
"beats_type": "packetbeat",
"packetbeat_dns_answers_3_class": "IN",
"packetbeat_dns_answers_5_data": "2001:4998:24:120d::1:1",
"packetbeat_dns_response_code": "NOERROR",
"packetbeat_client_ip": "192.168.0.10",
"packetbeat_dns_flags_recursion_allowed": true,
"packetbeat_dns_question_class": "IN",
"packetbeat_dns_answers_0_class": "IN",
"packetbeat_client_port": 57935,
"timestamp": "2016-04-01T00:00:00.000Z",
"packetbeat_dns_answers_1_type": "AAAA",
"packetbeat_dns_answers_4_class": "IN",
"packetbeat_dns_op_code": "QUERY",
"packetbeat_bytes_out": 51,
"packetbeat_beat_hostname": "example.local",
"packetbeat_dns_answers": [],
"message": "-",
"packetbeat_dns_answers_1_class": "IN",
"packetbeat_dns_answers_5_name": "yahoo.com",
"packetbeat_count": 1,
"packetbeat_dns_answers_3_type": "AAAA",
"packetbeat_dns_answers_2_name": "yahoo.com",
"packetbeat_port": 53
}
@ryan-carroll-graylog by removing the source JSON array contents we can't do any additional processing - in this case, we would want to use a jsonpath()
statement to put all of the answers as values in to one field, and possibly the same for the record types, etc. This would be pretty easy to do with jsonpath()
using the original JSON but trying to rebuild that data based on the flattened fields would be imprecise and difficult.
If possible, it would be nice to always retain the source data but provide an option to flatten arrays containing JSON, but we at least need to have the original JSON vaules in the arrays.
If possible, it would be nice to always retain the source data but provide an option to flatten arrays containing JSON, but we at least need to have the original JSON vaules in the arrays.
Ah gotcha totally makes sense @miwent, it should be pretty straightforward to add the answers array in tact from looking at the input.
I'll see about making the flattening behavior optional across the board in the input config too.
@ryan-carroll-graylog Please ensure that the changes we make don't break or change the behavior of the input for existing users. :slightly_smiling_face:
@ryan-carroll-graylog Please ensure that the changes we make don't break or change the behavior of the input for existing users. 🙂
For sure @bernd, good call out! Will keep a close eye on backwards compatibility.
Do you see any issue with populating the flattened fields in addition to flattening them? This would be a change in behavior but strictly additive.
Another option is a new "Flatten JSON" config option that defaults to the current behavior but that could be toggled off to store only top level fields with their raw json objects.
A third option that just occurred to me, and let me know if this would work for you @miwent, is just adding the standard "Store full message" option, defaulting to off. This seems like the least intrusive and most flexible option to me if it provides us what we need for Illuminate.
@ryan-carroll-graylog Please ensure that the changes we make don't break or change the behavior of the input for existing users. 🙂
For sure @bernd, good call out! Will keep a close eye on backwards compatibility.
Do you see any issue with populating the flattened fields in addition to flattening them? This would be a change in behavior but strictly additive.
Another option is a new "Flatten JSON" config option that defaults to the current behavior but that could be toggled off to store only top level fields with their raw json objects.
A third option that just occurred to me, and let me know if this would work for you @miwent, is just adding the standard "Store full message" option, defaulting to off. This seems like the least intrusive and most flexible option to me if it provides us what we need for Illuminate.
My concern with the option of storing a full message would be that the beats logs already are quite large and users may not be happy with what would effectively double the message size. Having the option to store full messages would probably be nice in general, not having the original fields would still be an issue on the Illuminate side since I would expect many (if not most) customers to decide not to store the full message. Illuminate would still require the original fields/values exist for Illuminate in that case.
What do you think about the following? @ryan-carroll-graylog @miwent
We add #setOriginalSource()
and #getOriginalSource()
methods to the Message
class that set and return the original message as bytes or string. In addition to that, we add has_original_source
and get_original_source
(with an optional default return value) pipeline functions to check and retrieve the original source value.
In input codecs, we can add the original source to each message object. Not into the message fields
map, but either as a separate field or the existing metadata map. That way the original source doesn't get indexed into OpenSearch.
The main drawback is that the memory consumption of the in-flight Message
object increases. We can benchmark that to see how big of an impact that is. We can also add an option to inputs to disable the storage of the original source data in the in-memory message object. (default is enabled)
Would that help with the packetbeat use case?
/cc @Graylog2/architecture
@bernd that's a really awesome idea! I think it would solve a lot of the issues we've run into with Illuminate where we've needed the full message and had to resort to using the full_message
field, which as @miwent pointed out can be a pain point with users.
Definitely would be a different way of doing things on the Illuminate parsing side so curious on your thoughts @miwent.
/cc @kingzacko1 @danotorrey for y'all's thoughts too.
@bernd I second that, it would work for Packetbeat and I could see it being useful in other scenarios.
@miwent what is the timeline that a solution is needed for this?
Since the scope of @bernd's proposal is bigger than what we additionally had allotted for this issue for this cycle (should get @Graylog2/architecture buy in, as well as syncing up on how this needs to work on for Illuminate to use the pipeline rules, extra dedicated testing), the team was thinking we need to push the original source
changes to a later cycle.
Just want to make sure that doesn't interfere with any planned Illuminate releases. If it does, maybe explore some short term solutions.
/cc @rich-graylog @waab76
@miwent what is the timeline that a solution is needed for this?
Since the scope of @bernd's proposal is bigger than what we additionally had allotted for this issue for this cycle (should get @Graylog2/architecture buy in, as well as syncing up on how this needs to work on for Illuminate to use the pipeline rules, extra dedicated testing), the team was thinking we need to push the
original source
changes to a later cycle.Just want to make sure that doesn't interfere with any planned Illuminate releases. If it does, maybe explore some short term solutions.
/cc @rich-graylog @waab76
We already have a temporary workaround in place, it's not ideal but it does allow us to be patient with this.
Created a new issue to track Bernd's originalSource
idea. Added it to the TDIR cycle 3 board "Waiting Prioritization" @rich-graylog but feel free to move it.
Packetbeat fields that contain a JSON array containing JSON objects have the contain JSON objects removed, resulting in an empty field.
Expected Behavior
Packetbeat fields that contain a JSON array of JSON objects retain those values. The field should at least be stored as a string in order to allow parsing and processing with the JSON processing tools in the pipeline.
Current Behavior
Packetbeat DNS logs contain a field
packetbeat_dns_answers
that is a list of the records that are captured when a DNS query+response are logged. This field is logged in Graylog as an empty list:Configuring packetbeat to output to the console shows that the field does have a value though:
I tested this by adding a field in the Packetbeat config:
And the field was added but it was also added as an empty array:
Possible Solution
Steps to Reproduce (for bugs)
packetbeat_dns_answers
fieldContext
The DNS answers is one of the more meaningful bits of information. It is almost impossible to reconstruct this information reliably from the remaining fields that are indexed.
Your Environment