awslabs / amazon-kinesis-scaling-utils

The Kinesis Scaling Utility is designed to give you the ability to scale Amazon Kinesis Streams in the same way that you scale EC2 Auto Scaling groups – up or down by a count or as a percentage of the total fleet. You can also simply scale to an exact number of Shards. There is no requirement for you to manage the allocation of the keyspace to Shards when using this API, as it is done automatically.
Apache License 2.0
338 stars 95 forks source link

Smoother scaling up by having checkInterval configurable and a cool off period when upscaling #87

Closed smallo closed 4 years ago

smallo commented 4 years ago

Given the scale up configuration:

    "scaleUp": {
      "scaleThresholdPct": 90,
      "scaleAfterMins": 5,
      "scaleCount": 2

Given that checkInterval is hardcoded to 45 seconds and not configurable; and given that the resharding operation can take more than 45 seconds, it normally happens that amazon-kinesis-scaling-utils ends up deciding to scale up several times, when only once was necessary.

In that situation, the number of total shards is greater than it should be, which means the scale down policy will need to decrease the unnecessary shards. And which is even worse, many intermediate shards are created when upscaling and downscaling (in order to evenly distribute the hashes between the final opened shards), which occasionally makes consumers using the KCL library to be unbalanced as most of the opened shards are concentrated in one or a few consumers.

This effect creates several problems:

A couple of easy changes to the JSON configuration would fix this problem, they can be used independently or be complementary:

With this two changes the user can increase the checkInterval to a value greater than the time it takes to do the resharding while reducing the traffic to CloudWatch. Or, if the user wants to react much faster to a traffic increase, he can set a scaleUp.coolOffMins value to let the system do the resharding.

Obviously, these two parameters are also useful to adapt the way a given system wants to react in front of specific traffic patterns as they provide configuration options closer to those which can be found in EC2.

IanMeyers commented 4 years ago

Makes sense. Will see what I can do.