awslabs / amazon-kinesis-producer

Amazon Kinesis Producer Library
Apache License 2.0
399 stars 331 forks source link

No exponential backoff? #35

Open samcday opened 8 years ago

samcday commented 8 years ago

We were observing some pretty insane amounts of network throughput hitting the reverse proxy we have in front of kinesis.us-west-1.amazonaws.com:

image

It turns out that our stream was saturated. As soon as we resharded things settled down dramatically. I can only take this to mean that when KPL runs into ProvisionedThroughputExceptions it simply starts retrying in a tight loop? Shouldn't there be some kind of backoff period (ideally configurable)?

rdifalco commented 8 years ago

Agreed on the lack of back off. It actually retries so often that it takes over the CPU and ends up reducing throughput to Kinesis.

samcday commented 7 years ago

No, I think this is still a bug. Retrying a remote request with zero retry delay or exponential back-off is a definite no-no

pfifer commented 7 years ago

Exponential back off is something we're looking at. Unfortunately it doesn't mesh well with the time based retry system that the KPL uses. Exponential back off, when combined with jitter, will cause records to expire in unexpected ways.

As a a sort of informal poll, which do people prefer:

samcday commented 7 years ago

I think an easy way to mitigate your concerns with record TTL is to simply allow users to configure maximum amount of exp backoff. That way if I've configured my record TTL to be something like 5 minutes and my max exp backoff to be 1m then I can be sure that the record will get retried a reasonable number of times before it expires.

pfifer commented 7 years ago

@samcday I like the idea of using the TTL of the record to calculate exponential back off.

We're still investigating this, and prioritizing this with other customer requests.

For others who might be impacted could you please add a reaction or response to assist us in prioritizing the work.

Thanks

jinbuilds commented 6 years ago

@pfifer was this bug/potential improvement ever done?

ObviousDWest commented 5 years ago

BTW, there is a conflict in the documentation: This says there is backoff in the FailIfThrottled comment: https://github.com/awslabs/amazon-kinesis-producer/blob/master/java/amazon-kinesis-producer-sample/default_config.properties. But this says there is not backoff algorithm: https://docs.aws.amazon.com/streams/latest/dev/kinesis-producer-adv-retries-rate-limiting.html.

As to preference, I'd sure love at least a config setting that adds a sleep after receiving a ThroughputExceeded. Right now, KPL goes to 100% CPU and starts starving my app of cycles, leading to app errors. This is really hard to account for (allocate a 1/2 CPU for KPL in the rare case I exceed my shard allocation?), and it makes the app unstable. It is also hard to adjust the rate limiter, because our app is elastic, so I never know how many instances are currently sharing a shard, so I can't specify a rate limit percentage. Some kind of dynamic backoff is required. Maybe you could halve the rate limiter for that shard each time there is a throughput error (up to some limit), and reset it to the configured value each time it succeeds.