Specifying multiple log_key for cloudwatch_logs

arunz87 commented 2 years ago

Describe the question/issue

@PettitWesley - I have a use case wherein my python code that expects a custom log format. Code worked fine until the record size in CloudWatch started hitting the hardlimit of 256kb event record size. This led me on a quest to reduce the record size by trimming those fields which are unnecessarily large and not required for analysis. After some reading, I have this neat little configuration that does it, except that cloudwatch_logs plugin doesn't seem to offer a way to send the fields in the expected format. Earlier, the log_key was taking the default 'log' value which worked well. Now when I trimmed the fields using a parser, I am unable to regenerate the log format in Cloudwatch. Is there a way to workaround this ?

Configuration #########parser.conf############ [PARSER] Name myparser Format regex Regex ^(?[^ ]) (?[^ ]) (?[^ ]) (?[^ ]) (?(")) (?[^"]*) (?("))

PS: field5 and field7 correspond to quotes which are later removed in the filter

#########fluentbit.conf########## [SERVICE] Flush 5 Daemon off Parsers_file parser.conf Log_Level debug [INPUT] Name tail Tag foo Path /var/tmp/foo.log Path_Key filename Skip_Long_Lines off [FILTER] Name parser Match foo Key_Name log Parser myparser [FILTER] Name modify Match foo Remove field5 Remove field7 Set field6 "" [OUTPUT] Name cloudwatch_logs Match foo region us-east-1 log_group_name my_log_group log_stream_name my_log_stream log_key field1 field2 field3 field4 field6 log_format json/emf

Fluent Bit Log Output

The error observed is: [2022/02/14 05:30:41] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Could not find log_key 'field1 field2 field3 field4 field5' in record

Fluent Bit Version Info

Which AWS for Fluent Bit Versions have you tried?* 2.10.1

Steps to reproduce issue

Use the configuration above to start the fluentbit v2.10.1 container on an EC2 instance.
Custom log file looks like this: "EVENT_1" "123" 1304509890 "10000" "1 0 0 hostname1.domain.com hostname2.domain.com hostname3.domain.com hostname4.domain.com hostname5.domain.com hostname6.domain.com hostname7.domain.com " "EVENT_2" "12" 1304509900 "123000" "2 0 0 hostname3.domain.com hostname4.domain.com hostname5.domain.com hostname6.domain.com " "EVENT_3" "123355" 1304509919 "1400" "5 0 0 hostname1.domain.com hostname4.domain.com hostname5.domain.com hostname6.domain.com hostname7.domain.com "
Desired output in Cloudwatch log stream looks like: "EVENT_1" "123" 1304509890 "10000" "" "EVENT_2" "12" 1304509900 "123000" "" "EVENT_3" "123355" 1304509919 "1400" ""

PettitWesley commented 2 years ago

So if I understand correctly, your input logs are lines that have a series of space delimited fields:

field1 field2 field3 field4 field5 field6 field7

Right now, you are parsing this log value which means you get a json log like:

{ 
    "field1": val,
     "field2": val,
}

However, then you remove some of the keys. (Why? Why not just take all the data to CW?)

And you want the output in CW to not be json, but to just the selected original fields in a line again?

field1 field2 field3 field4

This is what you want right? And so you came to the log_key option since it takes a JSON log and just sends a string value.

So in this case... we don't support what you want in the CW plugin... I also kind of feel like may be this is a more generic FB use case. You want to 'un-parse' you logs and take them from json back to just a string. Which FB doesn't support.

Oh wait... I just realized, the Kinesis and Firehose go plugins do have this feature, they call it data_keys. https://github.com/aws/amazon-kinesis-streams-for-fluent-bit

Which is implemented with this code:

So the easiest solution here would be to add the data_keys feature to the CloudWatch Go plugin: https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit

This is a kind of a niche request... given the other things I have on my plate I can't prioritize it right now. However, it would be very easy for you to build this feature yourself in the go plugin and send us a PR. I recommend doing that.

https://github.com/aws/aws-for-fluent-bit#developing-features-in-the-aws-plugins

arunz87 commented 2 years ago

Thanks @PettitWesley for your prompt response.

data_keys is close to what I wanted but not quite what I was looking for. I don't want the keys and values to be sent to Kinesis (or Cloudwatch in my use case), rather just the "values" so that I get a space delimited fields as output.

The reason why I cannot send the whole data to CW is due to the sheer size of the log line that exceeds the allowable limit (256kb) in CW. So the line gets truncated by CW and I need to trim it prior to analyzing it further.

PettitWesley commented 2 years ago

@arunz87 I see/makes sense.

So we have two options here:

Keep this ticket open as a feature request. I can't prioritize time on it right now as we have many other bugs and features that need to be worked on first.
Implementing this feature in the Go plugin still wouldn't be too hard, even if you have to build a new function to do this, so you're welcome to submit a PR.

aws / aws-for-fluent-bit

Specifying multiple log_key for cloudwatch_logs #299

PS: field5 and field7 correspond to quotes which are later removed in the filter