[kinesis go connector] Record throughput less than expected going to kinesis

zackwine commented 4 years ago

Bug Report

Describe the bug

When fluentbit is configured to tail and process Kubernetes logs, then forward to kinesis, we have found that it performs about the same as fluentd with a similar setup.

Fluentd will start to drop logs at about 1833 log lines per second.
Fluentbit will start to drop logs at about 2000 log lines per second.

NOTE: Our application log lines average about 1.1K per log line with 20-30 fields per log line.

We transitioned our Kubernetes log forwarder from fluentd to fluentbit, based on published benchmarks comparing fluentd to fluentbit we expected much better performance/throughput. What we are finding is that the memory requirements are much lower for fluentbit, but the CPU requirements are only slightly lower. Admittedly we have heavy lua parsing of each log, but we run the same/similar parsing in fluentd.

To Reproduce

Steps to reproduce the problem:

Configure a logging pipeline similar to the following:

tail (fstat)-> docker parser -> lua parsing -> kinesis

Then send 2000 log lines per second of 1.1K size to a log file. In our test we send 500 log lines per second to four separate files to more closely simulate our use case.

Expected behavior

Forward logs without data loss at rates higher than 2000 log lines per second.

Screenshots

In an effort to understand where the bottleneck was I ran perf on fluentbit with our load test. See the results below:

Screen Shot 2020-05-05 at 5 31 55 PM

This shows a significant amount of time spent in the kinesis output, lua parsers, and input. That all makes sense, but I've been trying to understand why the mpack_discard function is so high up on the list.

Your Environment

Version used: 1.4.1 Configuration: See below Environment name and version: Kubernetes 1.7 Operating System and version: Distroless container image on top of Centos 7.6.1810 kernel Filters and plugins: See below Our use case is a fluenbit-daemonset that forwards logs from many varying components:

pod -> docker-json-driver -> fluenbit-daemonset -> kinesis This is an snippet of our config:

[SERVICE]
  # Flush
  # =====
  # Set an interval of seconds before to flush records to a destination
  Flush        1

  # Daemon
  # ======
  # Instruct Fluent Bit to run in foreground or background mode.
  Daemon       Off

  # Log_Level
  # =========
  # Set the verbosity level of the service, values can be:
  #
  # - error
  # - warning
  # - info
  # - debug
  # - trace
  #
  # By default 'info' is set, that means it includes 'error' and 'warning'.
  Log_Level    info

  # Parsers_File
  # ============
  # Specify an optional 'Parsers' configuration file
  Parsers_File parsers.conf
  Parsers_File parsers-custom.conf
  Plugins_File plugins.conf

  # HTTP Server
  # ===========
  # Enable/Disable the built-in HTTP Server for metrics
  HTTP_Server  On
  HTTP_Listen  0.0.0.0
  HTTP_PORT    24231

  # Define stream processors
  #Streams_File stream_processor.conf

[INPUT]
  Name              tail
  Tag               kube.*
  Path              /var/log/containers/*.log
  Parser            docker
  DB                /var/log/containers/flb_kube.db
  Buffer_Chunk_Size 512K
  Buffer_Max_Size   5120K
  Mem_Buf_Limit     51200K
  Skip_Long_Lines   On
  Refresh_Interval  2
  Rotate_Wait       10
  Exclude_Path      /var/log/containers/lmm-fluentd-daemonset-kinesis*,/var/log/containers/lmm-fluentbit-daemonset*

[FILTER]
  Name                 kubernetes
  Match                kube.*
  Kube_URL             https://kubernetes.default.svc:443
  Kube_CA_File         /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  Kube_Token_File      /var/run/secrets/kubernetes.io/serviceaccount/token
  Kube_Tag_Prefix      kube.var.log.containers.

# Drop logs missing the annotation logForwardStdOut_<container_name>: true
[FILTER]
  Name      lua
  Match     kube.*
  script    /lua/forward-opt-in.lua
  call      forward_opt_in
  Alias     lua.forward_opt_in

# Decorate logs from containers
[FILTER]
  Name   modify
  Match  kube.*
  Add    dc sturgill
  Add    logsource container-log
  Alias  modify.decorate

# Parse Logs
[FILTER]
  Name      lua
  Match     kube.*
  script    /lua/logging-standard.lua
  call      parse_logging_standard
  Alias     lua.parse_logging_standard

# Prune fields
[FILTER]
  Name record_modifier
  Match kube.*
  Remove_key kubernetes
  Remove_key stream
  Remove_key docker_timestamp
  Remove_key logtime
  Alias   record_modifier.prune

[FILTER]
  Name                     rewrite_tag
  Match                    kube.*
  Rule  $loglane           ^.*$ $loglane.$TAG false
  emitter_mem_buf_limit    50M

[OUTPUT]
  Name            kinesis
  Match           infra.*
  region          us-east-1
  stream          sturgill-infra
  append_newline  true
  Alias           kinesis.infra
#... Other kinesis outputs

Additional context

Our platform supports forwarding logs from many different applications running in the same Kubernetes cluster. We are currently scale testing fluentbit as a Daemonset to understand the limitations of how many records/second we can sustain. In this case we would like to be able to optimize fluentbit for the most throughput.

We are looking for guidance in optimizing performance of the daemonset. This has brought up a few questions:

Are there plans to add multithread support to fluentbit? Any suggestions to improve the performance?

fwerner commented 4 years ago

mpack_discard seems to be so high up the list because the incoming msgpack records are once fully consumed to generate the in_records metric. Disabling metrics should reduce CPU usage by at least 18% in your scenario.

I'm new to the whole fluent eco system – what struck me is that there generally seems to be no framing applied to the msgpack streams, is this correct? This performance penalty is an immediate result of this (and the more sensible choice to keep the filter callbacks simple and not force them to count the in/out records).

zackwine commented 4 years ago

@edsiper @fujimotos Any input on this?

edsiper commented 4 years ago

hi, thanks for the detailed report.

The filtering routing is forcing count records multiple times when multiple filters exists, I think there is a way to optimize that .

work in process.

edsiper commented 4 years ago

There are a couple of questions in the issue:

1. what is mpack_discard() doing ?

that function is used inside flb_mp_count() to count the total number of records in a Chunk. every record is serialized in a chunk for performance reasons (note that chunks records can be altered).

2. missing records

If I am not wrong you are using the Golang connector for Kinesis. When Fluent Bit pass control to the Golang code, Fluent Bit is out of context and will not work until the Golang function returns, there is a downside with this: if the golang code takes too much time you might miss a file rotation or certain events.

Right now the Amazon team is migrating their Golang plugins to C plugins to avoid this situation.

3. Threading

We will implement threaded input plugins but there is no ETA at the moment. Note that this will only guarantee to be able to consume/compute data faster before ingestion only.

edsiper commented 4 years ago

Performance troubleshooting

First of all THANK YOU for the detailed perf report :)

I was able to reproduce similar scenario locally, I tested with a log file of 10 million records and 4 chained filters (2 modify, record_modifier and lua).

After a little modification I was able to improve the filtering cycle by 2.3%, this was done since we can avoid one call to flb_mp_count() in the filtering cycle, see the attached changes:

perf

The first pane (up) it's the current code of GIT master. the function flb_filter_do() takes 71.46% of the whole process; with the little modification in the second pane (down) the same functions takes 69.16%.

It's not a big change but an improvement. I will continue digging on how to reduce it even more.

zackwine commented 4 years ago

@edsiper Thanks for the detailed responses and continued support.

edsiper commented 4 years ago

Now we are at 5.90% total improvement:

71.46% -> 69.16% -> %65.56

new

edsiper commented 4 years ago

I've submitted some changes on GIT master, if you can build it an test its performance would be awesome.

The changes in place are about the reduction of the need to invoke flb_mp_count() every time.

zackwine commented 4 years ago

I'll pull the tip. Thanks for the quick turn around.

Right now the Amazon team is migrating their Golang plugins to C plugins to avoid this situation.

Where can I find that work? Is the C plugin publicly available?

zackwine commented 4 years ago

@edsiper Thanks for these changes. I'm now seeing throughput over 2200 ll/sec before records are occasionally dropped.

That's about a 10% increase in throughput.

edsiper commented 4 years ago

@zackwine thanks for the info.

are records being dropped due to retry exhaustion or similar?

zackwine commented 4 years ago

@edsiper I suspect logs are dropped during file rotations of Kubernetes. I see the test sustain higher throughputs for ~5 mins until the buffers fill in fluentbit and the input starts falling behind, then I start seeing dropped records. I see the CPU usage bouncing off 1 vCPU throughout the test. Given the input is falling behind constantly since fluentbit has exhausted the single CPU core, file rotations seem to be the likely cause of dropped records.

I suspect the kinesis output plugin is providing backpressure due to the retry logic as well: https://github.com/aws/amazon-kinesis-firehose-for-fluent-bit/issues/23

This may be contributing to the periods when the CPU usage drops off full usage of a single vCPU.

edsiper commented 4 years ago

thanks for the explanation, makes sense.

cc: @PettitWesley

PettitWesley commented 4 years ago

@zackwine @edsiper

We're working on rewriting the Go plugins in C to improve performance and resource utilization. Right now, the plan is:

Amazon ElasticSearch Auth support in the existing ES plugin [in-progress]
CloudWatch Logs plugin in C [in-progress]
New plugin for Amazon S3 in C
Re-write Firehose plugin in C
Re-write Kinesis plugin in C

1 and 2 have PRs up already, and will be released in Fluent Bit 1.5. I can't say exactly how long it will take for the rest- I'm hoping they'll all be done by September, but no guarantees.

In terms of the specific issue here; I have some questions:

I suspect the kinesis output plugin is providing backpressure due to the retry logic as well: aws/amazon-kinesis-firehose-for-fluent-bit#23

I need to fix that.. but it will only be triggered if you hit errors. 2000 records per second is the expected max throughput for a 2 shard stream IIRC- are you seeing any throttling errors or any API errors? How many shards do you have?

Please read the "My experience using FireLens: reliability and recommendations" section here: https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/

I tested the CloudWatch Logs and Firehose Go plugins in that post. The scenario was different than you have here, but they were able to keep up with a few thousand log records per second.

I have seen issue with Fluent Bit getting behind tailing a log file using Go Plugins- but not until around 10,000 records per second. How many log files are you reading?

As Eduardo mentioned, the Go plugins screw up Fluent Bit's concurrency model, which might cause you to miss file rotations, you can read about its concurrency model here if you are curious: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md#concurrency

zackwine commented 4 years ago

@PettitWesley Thanks for the update on the C plugin. I'm interested in contributing to that project once it is public.

I need to fix that.. but it will only be triggered if you hit errors. 2000 records per second is the expected max throughput for a 2 shard stream IIRC- are you seeing any throttling errors or any API errors? How many shards do you have?

The load test sends sample logs from one of our applications. These logs average about 1.1K in size with about 20-30 fields (each varies slightly). These logs are sent at the rates mentioned above 2200 records in this case to a kinesis stream with 4 shards. I think fluentbit gets occasional back-pressure from kinesis, because I see these messages intermittently in the fluentbit logs:

time="2020-05-11T15:42:56Z" level=warning msg="[kinesis 2] 178 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:42:56Z" level=warning msg="[kinesis 2] 186 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:42:57Z" level=warning msg="[kinesis 2] 33 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:43:00Z" level=warning msg="[kinesis 2] 64 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:43:02Z" level=warning msg="[kinesis 2] 309 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:43:02Z" level=warning msg="[kinesis 2] 336 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:43:02Z" level=warning msg="[kinesis 2] 157 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:43:03Z" level=warning msg="[kinesis 2] 261 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:43:03Z" level=warning msg="[kinesis 2] 136 records failed to be delivered. Will retry.\n"

These messages are usually followed by a dip in the CPU usage of the fluentbit-daemonset.

I have seen issue with Fluent Bit getting behind tailing a log file using Go Plugins- but not until around 10,000 records per second. How many log files are you reading?

The test sends logs from 4 Kubernetes containers (so 4 separate log files).

For comparison to your numbers in the "under the hood" link we are sending 1126 * 2200 * 60 ~= 150MB per minute in aggregate to all 4 log files. Further our fluentbit configuration includes some parsing/routing logic in lua that is consuming around 10-20% of the CPU cycles (depending on log content). Both the larger average log size, and additional parsing accounts for some of the difference in throughput limit, but maybe not all?

PettitWesley commented 4 years ago

@zackwine Set the env var FLB_LOG_LEVEL to the value debug to enable debug logging in the Go plugin. Then it will print the actual error message for each failed record: https://github.com/aws/amazon-kinesis-streams-for-fluent-bit/blob/master/kinesis/kinesis.go#L298

PettitWesley commented 4 years ago

I'm going to work on removing the bad exponential backoff code from the Go plugins this week or next..

PettitWesley commented 4 years ago

@zackwine I'm very interesting in the debug log output for Fluent Bit.

I do suspect that the stupid exponential backoff code I wrote is contributing to this; I've put up pull requests to remove it from all the AWS go plugins. We will hopefully do a patch release of AWS for Fluent Bit in the next week to push out this fix. Apologies!

zackwine commented 4 years ago

I removed the backoff from the plugin, and I've been running scale tests today. I do see quite an improvement in throughput. I'm not running with the recent changes from @edsiper, but I'm pushing 2300 log lines per second without drops. Still running tests to find the limit. I'll run the next test with debug enabled.

zackwine commented 4 years ago

After enabling debug, I no longer see the messages records failed to be delivered. Also it appears fluentbit is dropping far more logs.

I assume the intermittent periods where we were hitting the back-pressure no longer occur since more resources are spent outputting debug messages?

PettitWesley commented 4 years ago

@zackwine

I assume the intermittent periods where we were hitting the back-pressure no longer occur since more resources are spent outputting debug messages?

Interesting... it is possible... in native Fluent Bit plugins, logging is handled in a separate thread, in the Go plugins, its in the same thread... may be it slows it down. I do not know enough about these sorts of things to know whether this is a reasonable hypothesis or not.

I have had an idea though which I think I will try out this weekend. I could introduce a "high performance mode" in the plugins where instead of synchronously processing and sending records, everything happens in a goroutine. This means that when data is flushed, the plugin would return execution to Fluent Bit immediately. The downside is that the plugin would not be able to accurately report errors and retries; the built-in metrics endpoint and configurable retry mechanism wouldn't work.

If I build that I will send you the code so you can test it out..

PettitWesley commented 4 years ago

@zackwine Please run your test-suite on the code here (which uses goroutines): https://github.com/aws/amazon-kinesis-streams-for-fluent-bit/pull/28

zackwine commented 4 years ago

@PettitWesley Ok, I should have results by this evening

zackwine commented 4 years ago

I do see the fluentd pod using more than one vCPU, but the pod is continously crashing with this error:

panic: reflect: call of reflect.Value.Index on int64 Value

goroutine 1494 [running]:
reflect.Value.Index(0x7fef10a0f2c0, 0xc001ad4008, 0x86, 0x0, 0x0, 0x7fef0ff3bee9, 0xc00121e000)
    /usr/local/go/src/reflect/value.go:966 +0x1c8
github.com/fluent/fluent-bit-go/output.GetRecord(0xc001bc16c0, 0x0, 0x1f4, 0xc001220000, 0xc000074f18)
    /go/pkg/mod/github.com/fluent/fluent-bit-go@v0.0.0-20190925192703-ea13c021720c/output/decoder.go:80 +0x106
main.pluginConcurrentFlush(0x2, 0x7feeeb804c00, 0x14c8f68, 0x7fef11496080, 0x0)
    /kinesis-build/amazon-kinesis-streams-for-fluent-bit/fluent-bit-kinesis.go:152 +0x227
main.flushWithRetries(0x2, 0x7feeeb804c00, 0x7fef014c8f68, 0x7fef11496080, 0x2)
    /kinesis-build/amazon-kinesis-streams-for-fluent-bit/fluent-bit-kinesis.go:126 +0x65
created by main.FLBPluginFlushCtx
    /kinesis-build/amazon-kinesis-streams-for-fluent-bit/fluent-bit-kinesis.go:120 +0x66

@PettitWesley The last test run in the graph above is running with your goroutines branch, and it was sending ~2800 records per second in between pod crashes. I will look into a fix for that issue with reflection. This looks promising.

PettitWesley commented 4 years ago

@zackwine Sorry... I tested the code but I guess I didn't test it enough.

I have not been able to reproduce the crash, but I have a guess as to what the problem is. The conversion of the C data structure for the records to go data structures happens in the goroutine. That probably isn't safe; once the Flush returns Fluent Bit probably frees that memory. I need to convert everything into Go-land structures first, then spin off the goroutine.

PettitWesley commented 4 years ago

I updated the code in my pull request; that issue should be fixed. I'm dealing with a different bug now; some of the records are turning up in Kinesis as null for some reason. Will keep working on this tomorrow..

Edit/update: Found the cause of the new bug... should have this working soon.

Dhiraj240 commented 4 years ago

@edsiper Can I take up the issue to do some work?

edsiper commented 4 years ago

@Dhiraj240 sorry, I did not understand, do you mean do a fix or something ?

Dhiraj240 commented 4 years ago

@edsiper Yes, I have installed fluent-bit on my local system and had applied for community bridge for this project, hence I asked to work on so that I can have more understanding before mentees are announced by you.

edsiper commented 4 years ago

@Dhiraj240 for any comments about community bridge please use #2147

PettitWesley commented 4 years ago

@zackwine Sorry for the delay; check out the updated code in the pull request: https://github.com/aws/amazon-kinesis-streams-for-fluent-bit/pull/28

I tested it fairly thoroughly this time. The only outstanding issue is that since the goroutines handle retries, I need to make them exponentially backoff. But for the purposes of testing your use case, this should be good enough. Looking forward to seeing the results.

zackwine commented 4 years ago

@PettitWesley Thanks for the update. This update produced much better results. I can now send about ~2650 log lines per second with the goroutines branch. If I push beyond that rate I see quite a few dropped logs.

A bit about our test before the results For each test run there is unique sessionId per container, and there are 4 containers that send 400K log lines. Below is dashboard showing a run at ~2500 ll/s, ~2800 ll/s, then 2650 ll/s.

Screen Shot 2020-05-22 at 12 03 13 PM

As you can see in Elastic we have dropped quite a few logs in the ~2800 ll/s run. I'm trying to understand where the drops are occurring. I've been adding debug to the plugin to see if the records are dropped before they reach the output, so I'll update you if I find any more details there.

Another issue: One thing I have also noticed if we get backpressure from kinesis (exceed provisioned write throughput), then this output can duplicate records. This duplication behavior isn't limited to your goroutines branch. Also even if the kinesis shard count is scaled quite high I do see the occasional response indicating we have exceeded the provisioned write throughput, but they are quite rare.

PettitWesley commented 4 years ago

this output can duplicate records

This unfortunately is known behavior, with no perfect solution at the moment.

In the non-goroutine case, Fluent Bit might send the plugin a chunk of say, 1000 log events in one invocation, which requires two 500 record Put requests to Kinesis. One might succeed and one might fail. The plugin will then tell Fluent Bit to retry the whole batch of data, and when it succeeds you'll have duplicates.

In the goroutine case I can actually fix this behavior. The Go plugin manages retries itself, so it could more intelligently discard records once they succeed and only retry the failed batches. I'm unfortunately juggling multiple projects right now, but if this goroutine option works for your use case, I'll make sure fixing it up, add exponential backoff, and fix the retries and making it fully prod-ready is prioritized.

PettitWesley commented 4 years ago

4 containers that send 400K log lines

@zackwine Wait- did I read that right- the log lines themselves are 400 kilobytes? If I'm understanding correctly than this issue makes a lot more sense... I have thoroughly tested the AWS Go plugins up to ~8,000 log lines per second, but the log lines were very very small (like 100 bytes).

~2500 ll/s, ~2800 ll/s, then 2650 ll/s.

Just to be clear- this is total throughput right? (As opposed to 2500 lines/s per container times 4 containers)

zackwine commented 4 years ago

@PettitWesley sorry that was confusing.

We sent 400k log lines per container with an average log size of 1k. I send a specific number of log lines to easily detect drops and duplicates.

The rate of logs is for the aggregate of all 4 containers. So 2650 ll/s is the sum of all 4 containers.

I appreciate the support. I’ll help you make this feature production ready.

PettitWesley commented 4 years ago

I just had one idea that might either help with the problem or might help you determine the cause. If your Kubernetes cluster is using Docker, I believe that means it is using the JSON File Log driver. You can set the log file size and the number of files that will be written before rotation. If toggling those settings reduces log loss, then that would prove Fluent Bit is missing log rotations. If it doesn't, then something else must be at play..

I’ll help you make this feature production ready.

Awesome! 💪

zackwine commented 4 years ago

I updated max-size from 100m to 600m and restarted docker. It doesn't seem to make a difference in the number of logs dropped. I still see logs start dropping at the same rate as before.

zackwine commented 4 years ago

@PettitWesley I added a metric/debug to count the records the output plugin receives. I was able to verify all records are reaching the output plugin, but some are dropped at higher rates. I'll continue to debug.

zackwine commented 4 years ago

@edsiper @PettitWesley Thanks for the support. I'm going to close this now that the concurrency feature has been merged: https://github.com/aws/amazon-kinesis-streams-for-fluent-bit/pull/33

fluent / fluent-bit

[kinesis go connector] Record throughput less than expected going to kinesis #2159

Bug Report