Open tjbaker opened 8 years ago
@tjbaker, I can understand your thought process here.. I struggled with this at first too because our existing log and event shipping framework is built with Apache Flume, where we get some automatic "wrappers" built into all the messages.
Fundamentally the difference here is that the Amazon Kinesis Agent doesn't actually modify any of your data .. it just sends it as-is. This means that you need to change your event/log-generating-tool to include the local hostname, then you're golden.
For what its worth, we switched our Syslog-NG config to write out its logs in JSON format (easy change). Once we have it in JSON format in Kinesis, we run a Lambda that parses the data, wraps it in some fancy wrapper-stuff and then ships it to Firehose and ElasticSearch.
(send me a message if you want more info about that.. we're still writing it, but its running over a billion events a day right now and we plan to open source it)
In cases where you own/control the app, augmenting the log events with host seems reasonable. Unfortunately, there are cases where you do not have the ability to modify event logging, like when the app is from a 3rd party, not configurable, etc. Writing some process to touch every line in the log file to prepend host info is really hacky.
For the process in question, we have to use a combination of dataProcessingOptions/customFieldNames/matchPattern. This works just fine, though it does seem silly to have to specify one of the supported logFormat even when you're overriding with matchPattern, but that is another matter. It would be great if something like customFieldNames allowed you to inject a kv pair into the event for the stream. I could see using {instance_id} here.
For background, we are using Splunk's universal forwarder which includes host info automatically. My first AWS agent foray was with the CWL Agent, and its ability to include the hostname in the log stream, which ends up as a field on the event in ES is very similar to what you get in Splunk. I'm now kicking the tires on Kinesis Agent and Kinesis Analytics trying to replicate alerting and a bumping into this issue of not knowing where the event originated...
Fair enough -- in the case where you're having the agent do string parsing on each line, I could see adding in host information. Just curious, what app are you using that writes out logs without the ability to add in a hostname or customize the log format in some way?
so how about this issue going. we are also want to know how to add hostname and customize fileds before sending the raw data to stream.
Bumping this. I just submitted a PR to add arbitrary JSON metadata to the record. Let me know if this is something you guys would be interested in. It's not quite automatic, but much more extensible as you can put whatever JSON you want in the metadata field.
@zacharya Thanks for following up in this and submitting the PR! I'll take a look ASAP.
how do you work around to add add custom field using dataProcessingOptions/customFieldNames/matchPattern? @tjbaker . What I actually need is to prepend each stream with a number or a string to indicate a flow it comes from. Is there a way to do it in the agent? Either way will work to separate flows in the lambda downstream...
The CloudWatch Logs Agent provides predefined variables to retain info about the host sending the log data.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html
I see no way to retain sending host information for data forwarded to firehoses via the Kinesis Agent. Our use case includes sending the firehose to Elasticsearch and we have no way of knowing which host generated the monitored log. We want to know where the logs came from. Is there a mechanism similar to what the CWL Agent provides that I am overlooking?
Thanks, Trevor