Open samcday opened 8 years ago
Agreed on the lack of back off. It actually retries so often that it takes over the CPU and ends up reducing throughput to Kinesis.
No, I think this is still a bug. Retrying a remote request with zero retry delay or exponential back-off is a definite no-no
Exponential back off is something we're looking at. Unfortunately it doesn't mesh well with the time based retry system that the KPL uses. Exponential back off, when combined with jitter, will cause records to expire in unexpected ways.
As a a sort of informal poll, which do people prefer:
I think an easy way to mitigate your concerns with record TTL is to simply allow users to configure maximum amount of exp backoff. That way if I've configured my record TTL to be something like 5 minutes and my max exp backoff to be 1m then I can be sure that the record will get retried a reasonable number of times before it expires.
@samcday I like the idea of using the TTL of the record to calculate exponential back off.
We're still investigating this, and prioritizing this with other customer requests.
For others who might be impacted could you please add a reaction or response to assist us in prioritizing the work.
Thanks
@pfifer was this bug/potential improvement ever done?
BTW, there is a conflict in the documentation: This says there is backoff in the FailIfThrottled comment: https://github.com/awslabs/amazon-kinesis-producer/blob/master/java/amazon-kinesis-producer-sample/default_config.properties. But this says there is not backoff algorithm: https://docs.aws.amazon.com/streams/latest/dev/kinesis-producer-adv-retries-rate-limiting.html.
As to preference, I'd sure love at least a config setting that adds a sleep after receiving a ThroughputExceeded. Right now, KPL goes to 100% CPU and starts starving my app of cycles, leading to app errors. This is really hard to account for (allocate a 1/2 CPU for KPL in the rare case I exceed my shard allocation?), and it makes the app unstable. It is also hard to adjust the rate limiter, because our app is elastic, so I never know how many instances are currently sharing a shard, so I can't specify a rate limit percentage. Some kind of dynamic backoff is required. Maybe you could halve the rate limiter for that shard each time there is a throughput error (up to some limit), and reset it to the configured value each time it succeeds.
We were observing some pretty insane amounts of network throughput hitting the reverse proxy we have in front of
kinesis.us-west-1.amazonaws.com
:It turns out that our stream was saturated. As soon as we resharded things settled down dramatically. I can only take this to mean that when KPL runs into ProvisionedThroughputExceptions it simply starts retrying in a tight loop? Shouldn't there be some kind of backoff period (ideally configurable)?