Closed zackwine closed 4 years ago
mpack_discard
seems to be so high up the list because the incoming msgpack records are once fully consumed to generate the in_records
metric. Disabling metrics should reduce CPU usage by at least 18% in your scenario.
I'm new to the whole fluent eco system – what struck me is that there generally seems to be no framing applied to the msgpack streams, is this correct? This performance penalty is an immediate result of this (and the more sensible choice to keep the filter callbacks simple and not force them to count the in/out records).
@edsiper @fujimotos Any input on this?
hi, thanks for the detailed report.
The filtering routing is forcing count records multiple times when multiple filters exists, I think there is a way to optimize that .
work in process.
There are a couple of questions in the issue:
1. what is mpack_discard() doing ?
that function is used inside flb_mp_count() to count the total number of records in a Chunk. every record is serialized in a chunk for performance reasons (note that chunks records can be altered).
2. missing records
If I am not wrong you are using the Golang connector for Kinesis. When Fluent Bit pass control to the Golang code, Fluent Bit is out of context and will not work until the Golang function returns, there is a downside with this: if the golang code takes too much time you might miss a file rotation or certain events.
Right now the Amazon team is migrating their Golang plugins to C plugins to avoid this situation.
3. Threading
We will implement threaded input plugins but there is no ETA at the moment. Note that this will only guarantee to be able to consume/compute data faster before ingestion only.
Performance troubleshooting
First of all THANK YOU for the detailed perf
report :)
I was able to reproduce similar scenario locally, I tested with a log file of 10 million records and 4 chained filters (2 modify, record_modifier and lua).
After a little modification I was able to improve the filtering cycle by 2.3%, this was done since we can avoid one call to flb_mp_count() in the filtering cycle, see the attached changes:
The first pane (up) it's the current code of GIT master. the function flb_filter_do() takes 71.46% of the whole process; with the little modification in the second pane (down) the same functions takes 69.16%.
It's not a big change but an improvement. I will continue digging on how to reduce it even more.
@edsiper Thanks for the detailed responses and continued support.
Now we are at 5.90% total improvement:
71.46% -> 69.16% -> %65.56
I've submitted some changes on GIT master, if you can build it an test its performance would be awesome.
The changes in place are about the reduction of the need to invoke flb_mp_count() every time.
I'll pull the tip. Thanks for the quick turn around.
Right now the Amazon team is migrating their Golang plugins to C plugins to avoid this situation.
Where can I find that work? Is the C plugin publicly available?
@edsiper Thanks for these changes. I'm now seeing throughput over 2200 ll/sec
before records are occasionally dropped.
That's about a 10% increase in throughput.
@zackwine thanks for the info.
are records being dropped due to retry exhaustion or similar?
@edsiper I suspect logs are dropped during file rotations of Kubernetes. I see the test sustain higher throughputs for ~5 mins until the buffers fill in fluentbit and the input starts falling behind, then I start seeing dropped records. I see the CPU usage bouncing off 1 vCPU throughout the test. Given the input is falling behind constantly since fluentbit has exhausted the single CPU core, file rotations seem to be the likely cause of dropped records.
I suspect the kinesis output plugin is providing backpressure due to the retry logic as well: https://github.com/aws/amazon-kinesis-firehose-for-fluent-bit/issues/23
This may be contributing to the periods when the CPU usage drops off full usage of a single vCPU.
thanks for the explanation, makes sense.
cc: @PettitWesley
@zackwine @edsiper
We're working on rewriting the Go plugins in C to improve performance and resource utilization. Right now, the plan is:
1 and 2 have PRs up already, and will be released in Fluent Bit 1.5. I can't say exactly how long it will take for the rest- I'm hoping they'll all be done by September, but no guarantees.
In terms of the specific issue here; I have some questions:
I suspect the kinesis output plugin is providing backpressure due to the retry logic as well: aws/amazon-kinesis-firehose-for-fluent-bit#23
I need to fix that.. but it will only be triggered if you hit errors. 2000 records per second is the expected max throughput for a 2 shard stream IIRC- are you seeing any throttling errors or any API errors? How many shards do you have?
Please read the "My experience using FireLens: reliability and recommendations" section here: https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/
I tested the CloudWatch Logs and Firehose Go plugins in that post. The scenario was different than you have here, but they were able to keep up with a few thousand log records per second.
I have seen issue with Fluent Bit getting behind tailing a log file using Go Plugins- but not until around 10,000 records per second. How many log files are you reading?
As Eduardo mentioned, the Go plugins screw up Fluent Bit's concurrency model, which might cause you to miss file rotations, you can read about its concurrency model here if you are curious: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md#concurrency
@PettitWesley Thanks for the update on the C plugin. I'm interested in contributing to that project once it is public.
I need to fix that.. but it will only be triggered if you hit errors. 2000 records per second is the expected max throughput for a 2 shard stream IIRC- are you seeing any throttling errors or any API errors? How many shards do you have?
The load test sends sample logs from one of our applications. These logs average about 1.1K in size with about 20-30 fields (each varies slightly). These logs are sent at the rates mentioned above 2200 records in this case to a kinesis stream with 4 shards. I think fluentbit gets occasional back-pressure from kinesis, because I see these messages intermittently in the fluentbit logs:
time="2020-05-11T15:42:56Z" level=warning msg="[kinesis 2] 178 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:42:56Z" level=warning msg="[kinesis 2] 186 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:42:57Z" level=warning msg="[kinesis 2] 33 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:43:00Z" level=warning msg="[kinesis 2] 64 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:43:02Z" level=warning msg="[kinesis 2] 309 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:43:02Z" level=warning msg="[kinesis 2] 336 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:43:02Z" level=warning msg="[kinesis 2] 157 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:43:03Z" level=warning msg="[kinesis 2] 261 records failed to be delivered. Will retry.\n"
time="2020-05-11T15:43:03Z" level=warning msg="[kinesis 2] 136 records failed to be delivered. Will retry.\n"
These messages are usually followed by a dip in the CPU usage of the fluentbit-daemonset.
I have seen issue with Fluent Bit getting behind tailing a log file using Go Plugins- but not until around 10,000 records per second. How many log files are you reading?
The test sends logs from 4 Kubernetes containers (so 4 separate log files).
For comparison to your numbers in the "under the hood" link we are sending 1126 * 2200 * 60
~= 150MB per minute in aggregate to all 4 log files. Further our fluentbit configuration includes some parsing/routing logic in lua that is consuming around 10-20% of the CPU cycles (depending on log content). Both the larger average log size, and additional parsing accounts for some of the difference in throughput limit, but maybe not all?
@zackwine Set the env var FLB_LOG_LEVEL
to the value debug
to enable debug logging in the Go plugin. Then it will print the actual error message for each failed record: https://github.com/aws/amazon-kinesis-streams-for-fluent-bit/blob/master/kinesis/kinesis.go#L298
I'm going to work on removing the bad exponential backoff code from the Go plugins this week or next..
@zackwine I'm very interesting in the debug log output for Fluent Bit.
I do suspect that the stupid exponential backoff code I wrote is contributing to this; I've put up pull requests to remove it from all the AWS go plugins. We will hopefully do a patch release of AWS for Fluent Bit in the next week to push out this fix. Apologies!
I removed the backoff from the plugin, and I've been running scale tests today. I do see quite an improvement in throughput. I'm not running with the recent changes from @edsiper, but I'm pushing 2300 log lines per second without drops. Still running tests to find the limit. I'll run the next test with debug enabled.
After enabling debug, I no longer see the messages records failed to be delivered
. Also it appears fluentbit is dropping far more logs.
I assume the intermittent periods where we were hitting the back-pressure no longer occur since more resources are spent outputting debug messages?
@zackwine
I assume the intermittent periods where we were hitting the back-pressure no longer occur since more resources are spent outputting debug messages?
Interesting... it is possible... in native Fluent Bit plugins, logging is handled in a separate thread, in the Go plugins, its in the same thread... may be it slows it down. I do not know enough about these sorts of things to know whether this is a reasonable hypothesis or not.
I have had an idea though which I think I will try out this weekend. I could introduce a "high performance mode" in the plugins where instead of synchronously processing and sending records, everything happens in a goroutine. This means that when data is flushed, the plugin would return execution to Fluent Bit immediately. The downside is that the plugin would not be able to accurately report errors and retries; the built-in metrics endpoint and configurable retry mechanism wouldn't work.
If I build that I will send you the code so you can test it out..
@zackwine Please run your test-suite on the code here (which uses goroutines): https://github.com/aws/amazon-kinesis-streams-for-fluent-bit/pull/28
@PettitWesley Ok, I should have results by this evening
I do see the fluentd pod using more than one vCPU, but the pod is continously crashing with this error:
panic: reflect: call of reflect.Value.Index on int64 Value
goroutine 1494 [running]:
reflect.Value.Index(0x7fef10a0f2c0, 0xc001ad4008, 0x86, 0x0, 0x0, 0x7fef0ff3bee9, 0xc00121e000)
/usr/local/go/src/reflect/value.go:966 +0x1c8
github.com/fluent/fluent-bit-go/output.GetRecord(0xc001bc16c0, 0x0, 0x1f4, 0xc001220000, 0xc000074f18)
/go/pkg/mod/github.com/fluent/fluent-bit-go@v0.0.0-20190925192703-ea13c021720c/output/decoder.go:80 +0x106
main.pluginConcurrentFlush(0x2, 0x7feeeb804c00, 0x14c8f68, 0x7fef11496080, 0x0)
/kinesis-build/amazon-kinesis-streams-for-fluent-bit/fluent-bit-kinesis.go:152 +0x227
main.flushWithRetries(0x2, 0x7feeeb804c00, 0x7fef014c8f68, 0x7fef11496080, 0x2)
/kinesis-build/amazon-kinesis-streams-for-fluent-bit/fluent-bit-kinesis.go:126 +0x65
created by main.FLBPluginFlushCtx
/kinesis-build/amazon-kinesis-streams-for-fluent-bit/fluent-bit-kinesis.go:120 +0x66
@PettitWesley The last test run in the graph above is running with your goroutines branch, and it was sending ~2800 records per second in between pod crashes. I will look into a fix for that issue with reflection. This looks promising.
@zackwine Sorry... I tested the code but I guess I didn't test it enough.
I have not been able to reproduce the crash, but I have a guess as to what the problem is. The conversion of the C data structure for the records to go data structures happens in the goroutine. That probably isn't safe; once the Flush returns Fluent Bit probably frees that memory. I need to convert everything into Go-land structures first, then spin off the goroutine.
I updated the code in my pull request; that issue should be fixed. I'm dealing with a different bug now; some of the records are turning up in Kinesis as null for some reason. Will keep working on this tomorrow..
Edit/update: Found the cause of the new bug... should have this working soon.
@edsiper Can I take up the issue to do some work?
@Dhiraj240 sorry, I did not understand, do you mean do a fix or something ?
@edsiper Yes, I have installed fluent-bit
on my local system and had applied for community bridge for this project, hence I asked to work on so that I can have more understanding before mentees are announced by you.
@Dhiraj240 for any comments about community bridge please use #2147
@zackwine Sorry for the delay; check out the updated code in the pull request: https://github.com/aws/amazon-kinesis-streams-for-fluent-bit/pull/28
I tested it fairly thoroughly this time. The only outstanding issue is that since the goroutines handle retries, I need to make them exponentially backoff. But for the purposes of testing your use case, this should be good enough. Looking forward to seeing the results.
@PettitWesley Thanks for the update. This update produced much better results. I can now send about ~2650 log lines per second with the goroutines branch. If I push beyond that rate I see quite a few dropped logs.
A bit about our test before the results For each test run there is unique sessionId per container, and there are 4 containers that send 400K log lines. Below is dashboard showing a run at ~2500 ll/s, ~2800 ll/s, then 2650 ll/s.
As you can see in Elastic we have dropped quite a few logs in the ~2800 ll/s run. I'm trying to understand where the drops are occurring. I've been adding debug to the plugin to see if the records are dropped before they reach the output, so I'll update you if I find any more details there.
Another issue: One thing I have also noticed if we get backpressure from kinesis (exceed provisioned write throughput), then this output can duplicate records. This duplication behavior isn't limited to your goroutines branch. Also even if the kinesis shard count is scaled quite high I do see the occasional response indicating we have exceeded the provisioned write throughput, but they are quite rare.
this output can duplicate records
This unfortunately is known behavior, with no perfect solution at the moment.
In the non-goroutine case, Fluent Bit might send the plugin a chunk of say, 1000 log events in one invocation, which requires two 500 record Put requests to Kinesis. One might succeed and one might fail. The plugin will then tell Fluent Bit to retry the whole batch of data, and when it succeeds you'll have duplicates.
In the goroutine case I can actually fix this behavior. The Go plugin manages retries itself, so it could more intelligently discard records once they succeed and only retry the failed batches. I'm unfortunately juggling multiple projects right now, but if this goroutine option works for your use case, I'll make sure fixing it up, add exponential backoff, and fix the retries and making it fully prod-ready is prioritized.
4 containers that send 400K log lines
@zackwine Wait- did I read that right- the log lines themselves are 400 kilobytes? If I'm understanding correctly than this issue makes a lot more sense... I have thoroughly tested the AWS Go plugins up to ~8,000 log lines per second, but the log lines were very very small (like 100 bytes).
~2500 ll/s, ~2800 ll/s, then 2650 ll/s.
Just to be clear- this is total throughput right? (As opposed to 2500 lines/s per container times 4 containers)
@PettitWesley sorry that was confusing.
We sent 400k log lines per container with an average log size of 1k. I send a specific number of log lines to easily detect drops and duplicates.
The rate of logs is for the aggregate of all 4 containers. So 2650 ll/s is the sum of all 4 containers.
I appreciate the support. I’ll help you make this feature production ready.
I just had one idea that might either help with the problem or might help you determine the cause. If your Kubernetes cluster is using Docker, I believe that means it is using the JSON File Log driver. You can set the log file size and the number of files that will be written before rotation. If toggling those settings reduces log loss, then that would prove Fluent Bit is missing log rotations. If it doesn't, then something else must be at play..
I’ll help you make this feature production ready.
Awesome! 💪
I updated max-size from 100m to 600m and restarted docker. It doesn't seem to make a difference in the number of logs dropped. I still see logs start dropping at the same rate as before.
@PettitWesley I added a metric/debug to count the records the output plugin receives. I was able to verify all records are reaching the output plugin, but some are dropped at higher rates. I'll continue to debug.
@edsiper @PettitWesley Thanks for the support. I'm going to close this now that the concurrency
feature has been merged: https://github.com/aws/amazon-kinesis-streams-for-fluent-bit/pull/33
Bug Report
Describe the bug
When fluentbit is configured to tail and process Kubernetes logs, then forward to kinesis, we have found that it performs about the same as fluentd with a similar setup.
1833 log lines per second
.2000 log lines per second
.NOTE: Our application log lines average about 1.1K per log line with 20-30 fields per log line.
We transitioned our Kubernetes log forwarder from fluentd to fluentbit, based on published benchmarks comparing fluentd to fluentbit we expected much better performance/throughput. What we are finding is that the memory requirements are much lower for fluentbit, but the CPU requirements are only slightly lower. Admittedly we have heavy lua parsing of each log, but we run the same/similar parsing in fluentd.
To Reproduce
Configure a logging pipeline similar to the following:
Then send 2000 log lines per second of 1.1K size to a log file. In our test we send 500 log lines per second to four separate files to more closely simulate our use case.
Expected behavior
Forward logs without data loss at rates higher than 2000 log lines per second.
Screenshots
In an effort to understand where the bottleneck was I ran perf on fluentbit with our load test. See the results below:
This shows a significant amount of time spent in the kinesis output, lua parsers, and input. That all makes sense, but I've been trying to understand why the
mpack_discard
function is so high up on the list.Your Environment
Version used: 1.4.1 Configuration: See below Environment name and version: Kubernetes 1.7 Operating System and version: Distroless container image on top of Centos 7.6.1810 kernel Filters and plugins: See below Our use case is a fluenbit-daemonset that forwards logs from many varying components:
pod -> docker-json-driver -> fluenbit-daemonset -> kinesis This is an snippet of our config:
Additional context
Our platform supports forwarding logs from many different applications running in the same Kubernetes cluster. We are currently scale testing fluentbit as a Daemonset to understand the limitations of how many records/second we can sustain. In this case we would like to be able to optimize fluentbit for the most throughput.
We are looking for guidance in optimizing performance of the daemonset. This has brought up a few questions:
Are there plans to add multithread support to fluentbit? Any suggestions to improve the performance?