It is hard to coordinate zero downtime restarts of KCL 2 consumer

I'm using the 2.0.5 version of the KCL, using Enhanced fan-out on a stream with 8 shards. Typically I run with 4 processes pulling from these 8 shards, so 2 shards per consumer. When we do deploys of new code, we have a process which does the following:

N is currently 4, but may scale over time:

start N new consumers
wait for new consumers to be ready (DB connections warmed up, lease coordinator started, etc)
graceful shutdown all old consumers

This only partially works right now. The first two steps work fine, in fact when they start up they start pulling shards so that distribution is even. At this point we have 8 consumers, each with 1 shard.

However, when the graceful shutdown part starts, we enter a situation of downtime. The graceful shutdown will finish within a few seconds, which I believe includes releasing the lease. But the new consumers take up to an additional ~20 seconds to pick up these new leases. This is problematic for me because I am operating a real-time processing system in which 20+ seconds of lag during every deploy is unacceptable.

I think this is possibly related to the following logline:

With failover time 10000 ms and epsilon 25 ms, LeaseCoordinator will renew leases every 3308 ms, takeleases every 20050 ms, process maximum of 2147483647 leases and steal 1 lease(s) at a time.

Based on this I suppose I can try to lower the failover time to subsequently lower the takeleases time. But that probably has performance implications over the lifetime of a process, when it really only matters during startup and failover. Lowering it too far also opens us up to failures during long garbage collections.

Alternatively, if leases were based on ZooKeeper or Etcd, you could use a watcher to be notified in near realtime about the change. Based on a brief searching, it seems like DynamoDB Streams could be used in a similar fashion.

So this request is to provide some mechanism for zero downtime restarts. This may be in the form of better API around changing the polling frequency during startup, or (more ideally) by using DynamoDB Streams or similar to in near-realtime alert all shard consumers that there are leases ready for the taking. The latter would also improve general failover times.

Note: in order to speed up our realtime ingestion while waiting on this issue, I took three actions internally:

Scaled up my dynamodb to handle additional load from below
Lowered the failover time to 5s. I may lower further, but cannot go too low for fear of adding unnecessary failovers in the case of a long GC. This will speed up recovery during unplanned failures.
Added an "eager lease taker" thread. It's kicked off on startup by WorkerState.STARTED and runs only once. On a separately configurable interval (currently 250ms), this thread runs LeaseCoordinator#runLeaseTaker then LeaseCoordinator#getCurrentAssignments. This loops until the expected number of leases is assigned, or max time (currently 5 minutes) has been exceeded.

This (along with https://github.com/awslabs/amazon-kinesis-client/issues/475) has reduced my 99th percentile lag times from nearly 1 minute to under 8 seconds. Ideally I can get this lower at some point, but for now it suffices.

Building a more robust implementation using DynamoDB streams would be a great way to improve this lag for all users doing realtime processing.

awslabs / amazon-kinesis-client

It is hard to coordinate zero downtime restarts of KCL 2 consumer #474