Closed bbeaudreault closed 6 months ago
Note: in order to speed up our realtime ingestion while waiting on this issue, I took three actions internally:
WorkerState.STARTED
and runs only once. On a separately configurable interval (currently 250ms), this thread runs LeaseCoordinator#runLeaseTaker
then LeaseCoordinator#getCurrentAssignments
. This loops until the expected number of leases is assigned, or max time (currently 5 minutes) has been exceeded.This (along with https://github.com/awslabs/amazon-kinesis-client/issues/475) has reduced my 99th percentile lag times from nearly 1 minute to under 8 seconds. Ideally I can get this lower at some point, but for now it suffices.
Building a more robust implementation using DynamoDB streams would be a great way to improve this lag for all users doing realtime processing.
I'm using the 2.0.5 version of the KCL, using Enhanced fan-out on a stream with 8 shards. Typically I run with 4 processes pulling from these 8 shards, so 2 shards per consumer. When we do deploys of new code, we have a process which does the following:
N is currently 4, but may scale over time:
This only partially works right now. The first two steps work fine, in fact when they start up they start pulling shards so that distribution is even. At this point we have 8 consumers, each with 1 shard.
However, when the graceful shutdown part starts, we enter a situation of downtime. The graceful shutdown will finish within a few seconds, which I believe includes releasing the lease. But the new consumers take up to an additional ~20 seconds to pick up these new leases. This is problematic for me because I am operating a real-time processing system in which 20+ seconds of lag during every deploy is unacceptable.
I think this is possibly related to the following logline:
Based on this I suppose I can try to lower the failover time to subsequently lower the takeleases time. But that probably has performance implications over the lifetime of a process, when it really only matters during startup and failover. Lowering it too far also opens us up to failures during long garbage collections.
Alternatively, if leases were based on ZooKeeper or Etcd, you could use a watcher to be notified in near realtime about the change. Based on a brief searching, it seems like DynamoDB Streams could be used in a similar fashion.
So this request is to provide some mechanism for zero downtime restarts. This may be in the form of better API around changing the polling frequency during startup, or (more ideally) by using DynamoDB Streams or similar to in near-realtime alert all shard consumers that there are leases ready for the taking. The latter would also improve general failover times.