fluent-plugins-nursery / fluent-plugin-cloudwatch-logs

CloudWatch Logs Plugin for Fluentd
MIT License
201 stars 142 forks source link

sequence_token=nil for some streams #201

Closed lonnix closed 4 years ago

lonnix commented 4 years ago

Problem

Only one of my apps is not posting logs to cloudwatch. Everything is running on kubernetes with fluetnd as a daemonset and I see logs in cloudwatch for all but one app.

Steps to replicate

I cannot provide detailed replication steps because it does work for all but one app. I have killed/restarted the fluentd pod on the same node as the app multiple times. Each time the fluentd logs show that the logfile was picked up and is being tailed. The app is the nginx-ingress-controller installed with helm (https://github.com/helm/charts/tree/master/stable/nginx-ingress). No matter what steps I take, this app does not get logs posted to cloudwatch. I have used the @stdout filter to print events immediately before the cloudwatch plugin and I can confirm that the events are there and properly formatted and tagged.

output config

    <label @NORMAL>
      <match **>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_containers
        region "#{ENV.fetch('REGION')}"
        log_group_name_key my_log_group
        log_stream_name_key stream_name
        remove_log_stream_name_key true
        auto_create_stream true
        <buffer>
          flush_interval 1
          chunk_limit_size 1m
          queued_chunks_limit_size 32
          retry_forever true
        </buffer>
      </match>
    </label>

from the debug logs

2020-08-07 16:48:00 +0000 [debug]: #0 [out_cloudwatch_logs_containers] [Aws::CloudWatchLogs::Client 200 0.021137 0 retries] create_log_group(log_group_name:"/eks/nonprod/default/nginx-ingress-controller",tags:nil)

2020-08-07 16:48:00 +0000 [debug]: #0 [out_cloudwatch_logs_containers] [Aws::CloudWatchLogs::Client 200 0.012457 0 retries] describe_log_streams(log_group_name:"/eks/nonprod/default/nginx-ingress-controller",log_stream_name_prefix:"nginx-ingress-controller-744c6b44-hqbsf_default_nginx-ingress-controller-12345")

2020-08-07 16:48:00 +0000 [debug]: #0 [out_cloudwatch_logs_containers] [Aws::CloudWatchLogs::Client 200 0.020277 0 retries] create_log_stream(log_group_name:"/eks/nonprod/default/nginx-ingress-controller",log_stream_name:"nginx-ingress-controller-744c6b44-hqbsf_default_nginx-ingress-controller-12345")

2020-08-07 16:48:00 +0000 [debug]: #0 [out_cloudwatch_logs_containers] Called PutLogEvents API group="/eks/nonprod/default/nginx-ingress-controller" stream="nginx-ingress-controller-744c6b44-hqbsf_default_nginx-ingress-controller-12345" events_count=5 events_bytesize=9589 sequence_token=nil thread=69888292818600 request_sec=0.009042728

Expected Behavior or What you need to ask

I expect to see logs in cloudwatch from all of my apps

Using Fluentd and CloudWatchLogs plugin versions

I am using the fluent/fluentd-kubernetes-daemonset:v1.11-debian-cloudwatch-1 image from docker hub as-is without any modifications

lonnix commented 4 years ago

I was informed that the sequence token is not needed for the first call, so here is another attempt that also has sequence_token=nil. This one occurred almost an hour and a half after the ones in the previous post with no crashing, restarting or manual intervention between them.

2020-08-07 18:04:26 +0000 [debug]: #0 [out_cloudwatch_logs_containers] Called PutLogEvents API group="/eks/nonprod/default/nginx-ingress-controller" stream="nginx-ingress-controller-744c6b44-hqbsf_default_nginx-ingress-controller-12345" events_count=1 events_bytesize=1887 sequence_token=nil thread=69888254281320 request_sec=0.011911046
lonnix commented 4 years ago

I figured (at least part of) it out. The first call never succeeded so every call was the "first attempt" which never needed the sequence token. The reason the call failed though seems like a bug to me.

The log that I'm writing has a timestamp in it:

{"time":"2020-08-10T15:18:58+00:00", "host": .......}

If I change the name of that field to something else like not_time I start seeing events in CloudWatch. Why is a field that the app writes affecting sending messages.

Is this a bug with the plugin where its possibly using that time field somewhere and since its not in the specification CloudWatch wants (it was epoch time)?

cosmo0920 commented 4 years ago

This is because the timestamp format glitch causes this issue:

{"time":"2020-08-10T15:18:58+00:00", "host": .......}

should be parsed with time_format %Y-%m-%dT%H:%M:%S%z not default time_format %Y-%m-%dT%H:%M:%S.%NZ

For your timestamp format, this time_format line should be:

time_format %Y-%m-%dT%H:%M:%S%z

Otherwise, this format glitch causes exception:

irb> Time.strptime("2020-08-10T15:18:58+00:00","%Y-%m-%dT%H:%M:%S.%NZ")
Traceback (most recent call last):
        5: from /Users/cosmo/.rbenv/versions/2.6.3/bin/irb:23:in `<main>'
        4: from /Users/cosmo/.rbenv/versions/2.6.3/bin/irb:23:in `load'
        3: from /Users/cosmo/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
        2: from (irb):3
        1: from /Users/cosmo/.rbenv/versions/2.6.3/lib/ruby/2.6.0/time.rb:445:in `strptime'
ArgumentError (invalid date or strptime format - `2020-08-10T15:18:58+00:00' `%Y-%m-%dT%H:%M:%S.%NZ')

%Y-%m-%dT%H:%M:%S%z should be correct:

irb(main):005:0> Time.strptime("2020-08-10T15:18:58+00:00","%Y-%m-%dT%H:%M:%S%z")
=> 2020-08-10 15:18:58 +0000
lonnix commented 4 years ago

I will give this a try. Is time_format a field for the cloudwatch plugin or is that supposed to be on the <parse> of the <source> block?

Also, what causes this glitch?

lonnix commented 4 years ago

@cosmo0920 Is the glitch you mentioned any degree of consistent or more random?

cosmo0920 commented 4 years ago

@cosmo0920 Is the glitch you mentioned any degree of consistent or more random?

I'm not sure but CloudWatchLogs requests consistent timestamp.

cosmo0920 commented 4 years ago

I will give this a try. Is time_format a field for the cloudwatch plugin or is that supposed to be on the <parse> of the <source> block?

Also, what causes this glitch?

Yeah, <parse> should be on the <source> block.

lonnix commented 4 years ago

I switched to fluent-bit to use that cloudwatch plugin, so this issue can be closed

cosmo0920 commented 4 years ago

OK, Closing.