The kinesis autoscaler can't scale up by less than double

moskyb commented 3 years ago

So a little while ago, after running into issues using a scale up percentage of less than 100%, I submitted this PR. My understanding was if I had an (abridged) config like:

"scaleUp": {
  "scaleThresholdPct": 75,
  "scaleAfterMins": 1,
  "scalePct": 115 
}

and I had a stream that currently had 100 shards, that the kinesis autoscaler would say "100 shards * 1.15, okay the stream will have 115 shards when I scale up".

As far as I can tell from looking at the code though, that's not actually the case, as this line of code indicates that the autoscaler interprets scalePct: 115 as "Add 115% the shard's current capacity to it's existing capacity". This means that scalePct: 115 on a stream with 100 shards will actually scale the stream up to 215 shards.

The issue here isn't that this is the behaviour, that's totally fine; however, the config parser will throw an error if scaleUpPct is less than 100, meaning that any scale up operation must at least double the capacity of the stream.

I'm happy to go in an modify this in whatever way is necessary - change it so that we can use a scaleUpPct > 100, or change the scaling behaviour, but I'm not sure what the actual expected behaviour is - I'm hoping the maintainers can provide some clarity on this :)

IanMeyers commented 3 years ago

Yes, it appears that this logic has diverged. Very unfortunate - in this case I believe the config parser should be modified as folks will have configurations that rely on the setting of this value. Happy to take a PR for this or can fix sometime next week.

IanMeyers commented 3 years ago

Nevermind - fixing it now

IanMeyers commented 3 years ago

This should be fixed in 81b6fb8, version .9.8.3

moskyb commented 3 years ago

@IanMeyers what's the status of 9.8.3? Is it coming any time soon?

IanMeyers commented 3 years ago

.9.8.4 is here: https://github.com/awslabs/amazon-kinesis-scaling-utils/releases/tag/v.9.8.4

rebecca2000 commented 3 years ago

@IanMeyers @moskyb I noticed that the autoscaler still fails to scale up by less than double. Excerpt from config:

"scaleUp": {
    "scaleThresholdPct": 80,
    "scaleAfterMins": 1,
    "scalePct": 20,
    "coolOffMins": 5,
},

I expect this to add 20% more capacity to the stream. Observed behaviour: autoscaler detects it needs to scale up but fails to do so (current shard count = 1)

Jul 16 05:09:38 ip-172-31-32-32 server: 05:09:38.826 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Utilisation of PUT Records 105.62% at 16/07/2021, 05:08 upon current value of 1056.23 and Stream max of 1000.00
Jul 16 05:09:38 ip-172-31-32-32 server: 05:09:38.826 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - PUT Records performance analysis: 1 high samples, and 0 low samples
Jul 16 05:09:38 ip-172-31-32-32 server: 05:09:38.826 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Will decide scaling action based on metric PUT[Records] due to highest utilisation metric value 105.62%
Jul 16 05:09:38 ip-172-31-32-32 server: 05:09:38.826 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Scaling Votes - GET: DOWN, PUT: UP
Jul 16 05:09:38 ip-172-31-32-32 server: 05:09:38.875 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Next Check Cycle in 60 seconds
Jul 16 05:10:00 ip-172-31-32-32 dhclient[2735]: XMT: Solicit on eth0, interval 116910ms.
Jul 16 05:10:01 ip-172-31-32-32 systemd: Started Session 5 of user root.
Jul 16 05:10:01 ip-172-31-32-32 systemd: Started Session 6 of user root.
Jul 16 05:10:38 ip-172-31-32-32 server: 05:10:38.876 [pool-2-thread-1] INFO  c.a.s.k.s.auto.StreamMetricManager - Requesting 1 minutes of CloudWatch Data for Stream Metric GetRecords.Bytes

If I understand correctly, this line of code should add the current shard count to the scale up percent.

Could someone look into this? Thanks!

IanMeyers commented 3 years ago

Yep this looks like it should definitely be scaling up to 2 shards based upon a 105.62% put records threshold. Can you please confirm that you are running version .9.8.4?

rebecca2000 commented 3 years ago

Yep I am running that version, taken from the link in README

IanMeyers commented 3 years ago

OK - if you could please deploy the .9.8.5 version that's been uploaded into the /dist folder, and then please turn on DEBUG logging (Beanstalk application parameter LOG_LEVEL=DEBUG), we should be able to get more details about why it's deciding not to scale.

rebecca2000 commented 3 years ago

Thanks, the reason is "Not requesting a scaling action because new shard count equals current shard count, or new shard count is 0"

Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.320 [pool-2-thread-1] INFO  c.a.s.k.s.auto.StreamMetricManager - Requesting 1 minutes of CloudWatch Data for Stream Metric PutRecords.Records
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.368 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - GET Bytes performance analysis: 0 high samples, and 2 low samples
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.369 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - GET Records performance analysis: 0 high samples, and 2 low samples
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.369 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Will decide scaling action based on metric GET[Bytes] due to highest utilisation metric value 0.00%
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.369 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Utilisation of PUT Bytes 3.91% at 22/07/2021, 06:03 upon current value of 40986.02 and Stream max of 1048576.00
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.369 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - PUT Bytes performance analysis: 0 high samples, and 1 low samples
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.371 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Utilisation of PUT Records 98.60% at 22/07/2021, 06:03 upon current value of 985.95 and Stream max of 1000.00
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.371 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - PUT Records performance analysis: 1 high samples, and 0 low samples
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.371 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Will decide scaling action based on metric PUT[Records] due to highest utilisation metric value 98.60%
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.371 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Scaling Votes - GET: DOWN, PUT: UP
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.421 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Not requesting a scaling action because new shard count equals current shard count, or new shard count is 0
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.421 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Next Check Cycle in 60 seconds
Jul 22 06:05:47 ip-172-31-46-122 server: 06:05:47.421 [pool-2-thread-1] INFO  c.a.s.k.s.auto.StreamMetricManager - Requesting 1 minutes of CloudWatch Data for Stream Metric GetRecords.Bytes

IanMeyers commented 3 years ago

Great - can you please turn on DEBUG level logging, and we'll be able to see exactly what the calculation was?

rebecca2000 commented 3 years ago

All good, here are the additional logs

Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.490 [pool-2-thread-1] INFO  c.a.s.k.s.auto.StreamMetricManager - Requesting 1 minutes of CloudWatch Data for Stream Metric GetRecords.Records
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.527 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Utilisation of PUT Bytes 4.18% at 23/07/2021, 05:38 upon current value of 43878.37 and Stream max of 1048576.00
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.529 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - PUT Bytes performance analysis: 0 high samples, and 1 low samples
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.530 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Utilisation of PUT Records 105.59% at 23/07/2021, 05:38 upon current value of 1055.93 and Stream max of 1000.00
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.531 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - PUT Records performance analysis: 1 high samples, and 0 low samples
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.532 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Will decide scaling action based on metric PUT[Records] due to highest utilisation metric value 105.59%
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.532 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - GET Bytes performance analysis: 0 high samples, and 2 low samples
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.532 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - GET Records performance analysis: 0 high samples, and 2 low samples
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.532 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Will decide scaling action based on metric GET[Bytes] due to highest utilisation metric value 0.00%
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.532 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Scaling Votes - GET: DOWN, PUT: UP
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.532 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Determined Scaling Direction UP
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.577 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Current Shard Count: 1
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.578 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Calculated new Target Shard Count of 1
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.578 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Not requesting a scaling action because new shard count equals current shard count, or new shard count is 0
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.579 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Next Check Cycle in 60 seconds

IanMeyers commented 3 years ago

Hello,

So I missed it in your config the first time. Through version .9.8.6, having a scalePct less than 100 doesn't often result in any action being taken with Streams with a very low number of Shards - as we've observed here. However, if you install version .9.8.7, I've now extended this logic which has tripped up customers for ages. So now any scalePct will result in a scaling action being taken, even asking to scale up by 20% on 1 shard, which may mean you are over-provisioned. Also, the way that scaleDown configurations were expressed was really confusing for the same reasons. There is new documentation on this in the README.md, and you can find a set of examples in a unit test for the scaling calculation if you are interested. Please let me know if this meets your expectations?

Thx,

Ian

rebecca2000 commented 3 years ago

Thanks Ian, the unit tests are really helpful and the documentation is clear :) One small thing I noticed is that a scale up action will always add at least one shard, while scaling down might not change the shard count (apart from min shardCount = 1 of course). So for this case -

@Test
public void testScaleDownBoundary() {
    assertEquals(9, StreamScalingUtils.getNewShardCount(10, null, 10, ScaleDirection.DOWN));
    assertEquals(9, StreamScalingUtils.getNewShardCount(9, null, 10, ScaleDirection.DOWN));
}

Our stream will never scale below 9, which might not be desirable if the min shard count is eg 5 and the shard count naturally sits in that range.

Anyway, just my 2 cents. Thanks for clarifying the scaling behaviour!

IanMeyers commented 3 years ago

Hey there - yes that was intentional. I'd rather we not scale down and leave the stream with ample capacity than over-scale and result in throttling. This could be added as a switch to the overall architecture, but I think it's better to be conservative on scaling down - as you find elsewhere with cool-offs in EC2 etc.

awslabs / amazon-kinesis-scaling-utils

The kinesis autoscaler can't scale up by less than double #101