Optimizing fluentd and the kinesis plugin to handle large amount of log data

chungath commented 5 years ago

Hello,

We have a fluentd aggregator running in Amazon ECS inside a docker container. We also have micro services deployed to Amazon ECS which forward the logs to this fluentd aggregator using the fluentd agent installed in the the EC2 machines(This is basically a configuration in ECS tasks to send logs to the aggregator). The setup is similar to the one explained in here.

The Fluentd docker image used is fluent/fluentd:v1.5-1.

The Fluentd aggregator uses the fluent-plugin-kinesis plugin to send the logs to a Kinesis stream. The number of shards configured for the kinesis stream are 4(4 shards were enough when we used to have a cloudwatch integration instead of fluentd, hence we are trying currently with the same number with the setup using fluentd). The following are the configuration parameters for the kinesis plugin:

random_partition_key true
compression zlib
batch_request_max_size 1048576
retries_on_batch_request 20

We also have configured an S3 output plugin to backup the data in S3.

Now, the problem is that when there are large number of incoming log events per second coming to the fuentd aggregator, we get the a high number of the following error in the kinesis output plugin.

ProvisionedThroughputExceededException/Rate exceeded for shard shardId-00000000000x in stream xxxx

From kinesis documentation: A single kinesis shard can ingest up to 1 MiB of data per second or 1,000 records per second for writes. When this limit is crossed it throws the above ProvisionedThroughputExceededException.

We tried to add more retries(retries_on_batch_request) in the kinesis output plugin to resolve this. By default it was doing retry 8 times. We increased it to 20. Now that gave the following error when there was a surge in log data

error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data"

It seems that because there was so much data being retried, the fluentd worker processes were holding too much of data in memory(queued buffer chunks) causing this error - buffer space has too many data.

Then we also tried to use a file buffer instead of the memory buffer

      <buffer>
        @type file
        path /var/log/fluent/kinesis
        flush_interval 1s
        chunk_limit_size 1m
        flush_thread_interval 0.1
        flush_thread_burst_interval 0.01
        flush_thread_count 15
      </buffer>

This one gave the following errors:

#3 unexpected error on reading data host="x.x.x.x" port=xxx error_class=Fluent::Plugin::Buffer::BufferOverflowError error="can't create buffer metadata for /var/log/fluent/kinesis/worker3/buffer.*.log. Stop creating buffer files: error = No file descriptors available @ rb_sysopen - /var/log/fluent/kinesis/worker3/buffer.b59782a133f87668c9afec32218555f0c.log.meta"

#3 unexpected error while checking flushed chunks. ignored. error_class=RuntimeError error="can't enqueue buffer file and failed to restore. This may causes inconsistent state: path = /var/log/fluent/s3/worker3/buffer.b597829d1765431f51f5889c565a1bd15.log, error = 'No file descriptors available @ rb_sysopen - /var/log/fluent/s3/worker3/buffer.q597829d1765431f51f5889c565a1bd15.log', retry error = 'No file descriptors available @ rb_sysopen - /var/log/fluent/s3/worker3/buffer.b597829d1765431f51f5889c565a1bd15.log'"

Here the errors appeared not just for the kinesis file buffering, it also gave errors for s3 buffers(There were no such errors in s3 before using file buffer in kinesis). Given below is the s3 buffer configuration

    <buffer serviceName,time>
      @type file
      path /var/log/fluent/s3
      timekey 60 # 1 minute partition
      timekey_wait 5s
      timekey_use_utc true # use utc
    </buffer>

Question1: Seems like we are doing something wrong with the file buffer configurations here. Let us know how to sort this out.

Question2: What is the right configuration for making the kinesis output plugin support such a large number of log events or is it that we have to increase the number of shards in the kinesis stream to support this?

Please let us know.

simukappu commented 5 years ago

Hi, Thank you for your inquiry.

A1: Fluentd buffer configuration including file buffer is not configuration of this plugin, but this is configuration of Fluentd. Please see this Config: Buffer Section in Fluentd document.

A2: I think you should try increasing shard count. 4 in not a large number as shard count for a large number of log events.

chungath commented 5 years ago

Thanks for the response. We will increase the shard count.

In this case should we be using kinesis_streams_aggregated instead of kinesis_streams to handle high load?

simukappu commented 5 years ago

It depends on the use case. KPL format as kinesis_streams_aggregated may be cost effective, but you have to de-aggregate in a consumer. Please see the following documents: https://docs.aws.amazon.com/streams/latest/dev/kinesis-producer-adv-aggregation.html https://docs.aws.amazon.com/streams/latest/dev/kinesis-kpl-consumer-deaggregation.html

simukappu commented 5 years ago

Closing this issue for now. Please reopen if required.

awslabs / aws-fluent-plugin-kinesis

Optimizing fluentd and the kinesis plugin to handle large amount of log data #190