awslabs / amazon-kinesis-client

Client library for Amazon Kinesis
Apache License 2.0
644 stars 467 forks source link

It is hard to coordinate zero downtime restarts of KCL 2 consumer #474

Closed bbeaudreault closed 6 months ago

bbeaudreault commented 5 years ago

I'm using the 2.0.5 version of the KCL, using Enhanced fan-out on a stream with 8 shards. Typically I run with 4 processes pulling from these 8 shards, so 2 shards per consumer. When we do deploys of new code, we have a process which does the following:

N is currently 4, but may scale over time:

This only partially works right now. The first two steps work fine, in fact when they start up they start pulling shards so that distribution is even. At this point we have 8 consumers, each with 1 shard.

However, when the graceful shutdown part starts, we enter a situation of downtime. The graceful shutdown will finish within a few seconds, which I believe includes releasing the lease. But the new consumers take up to an additional ~20 seconds to pick up these new leases. This is problematic for me because I am operating a real-time processing system in which 20+ seconds of lag during every deploy is unacceptable.

I think this is possibly related to the following logline:

With failover time 10000 ms and epsilon 25 ms, LeaseCoordinator will renew leases every 3308 ms, takeleases every 20050 ms, process maximum of 2147483647 leases and steal 1 lease(s) at a time.

Based on this I suppose I can try to lower the failover time to subsequently lower the takeleases time. But that probably has performance implications over the lifetime of a process, when it really only matters during startup and failover. Lowering it too far also opens us up to failures during long garbage collections.

Alternatively, if leases were based on ZooKeeper or Etcd, you could use a watcher to be notified in near realtime about the change. Based on a brief searching, it seems like DynamoDB Streams could be used in a similar fashion.

So this request is to provide some mechanism for zero downtime restarts. This may be in the form of better API around changing the polling frequency during startup, or (more ideally) by using DynamoDB Streams or similar to in near-realtime alert all shard consumers that there are leases ready for the taking. The latter would also improve general failover times.

bbeaudreault commented 5 years ago

Note: in order to speed up our realtime ingestion while waiting on this issue, I took three actions internally:

This (along with https://github.com/awslabs/amazon-kinesis-client/issues/475) has reduced my 99th percentile lag times from nearly 1 minute to under 8 seconds. Ideally I can get this lower at some point, but for now it suffices.

Building a more robust implementation using DynamoDB streams would be a great way to improve this lag for all users doing realtime processing.