[FR] Kinesis based service scaling

smazurov commented 1 year ago

Is your feature request related to a problem? Please describe. I noticed there is ability to define scaling based on sqs backlog. Unfortunately, that is currently missing from kinesis implementation and thus the dream is incomplete.

Describe the solution you'd like Ideally, we can define a scaling target similarly to sqs based on how far behind we are in processing or if read/write throughput limits are hit.

Describe alternatives you've considered Slugging it by myself up a hill both ways (Custom CF/API calls)

JohnPreston commented 1 year ago

Hello @smazurov Thanks for opening this PR.

I haven't used Kinesis as much as I have used SQS, and SQS was one of the very first services supported, so sorry about that. With that said, you have the ability still to create alarms with x-alarms which you could in theory create pointing to your Kinesis data stream, and x-alarms does allow to create scaling rules for the services. We use Kafka a lot - similar to Kinesis - and scale services based on consumer lag. So I know this temporary solution would work.

See https://docs.compose-x.io/syntax/compose_x/common.html#x-resource-service-scaling-def

What I can certainly do though too is to create the Scaling section of Kinesis streams which would automatically create the alarms and scaling steps for you just like SQS. For that, to make sure that the feature answers your needs, can you just confirm for me the metrics you would like to scale the containers from?

Let me know!

smazurov commented 1 year ago

since thats supported, maybe just a doc can cover it. The relevant metrics are probably GetRecords.IteratorAgeMilliseconds, ReadProvisionedThroughputExceeded, and WriteProvisionedThroughputExceeded

JohnPreston commented 1 year ago

Well if find the time to test with x-alarms and report whether it's working or not that'd be very helpful. Also it you think that there should be a Math function to aggregate these metrics?

WriteProvisionedThroughputExceeded my understanding with "what to do" when you'd hit this limit is to autoscale the data stream itself and add shards. Same for ReadProvisionedThroughputExceeded I suppose, but you can't read from shards that haven't been written too, so the producer needs to have more shards to write to

GetRecords.IteratorAgeMilliseconds looks like the closest thing indeed from the equivalent in Kafka, called consumer lag, which represents the number of messages to read by a consumer, here I suppose the lower value means your consumers are keeping up with the volume of data to consume, if I am correct?

So here I see 2 features potentially

scale the ECS Services based on the volume of messages to read
scale the data stream to add shards when the throughput is exceeded.

Have you got autoscaling on your data stream already?

smazurov commented 1 year ago

scaling of streams is taken care of by using on-demand kinesis. I've seen some auto scaling solutions that involve lambdas on "provisioned" mode streams.

For WriteProvisionedThroughputExceeded, it is per shard, but i think we'd still potentially want to add producers, that way whatever generates the traffic scales out horizontally and new shards are created. Same for read. This is a bit theoretical, one I would use immediately is iterator age.

JohnPreston commented 1 year ago

Okay great, I will work to add scaling based on GetRecords.IteratorAgeMilliseconds as a default option/way to setup scaling for a data stream then.

I am going to aim for integrating StepScaling as that'd be my default way, but do you think that TargetTracking might be more appropriate here?

StepScaling when you want X containers given a range the metric value falls into. TargetTracking is when you want AWS to auto-compute how many containers are needed to achieve the target.

smazurov commented 1 year ago

hmmm, how would it work against IteratorAge? inverse (so older the age, more "loaded" service is)? That would work great, I think.

JohnPreston commented 1 year ago

Yeah that's what I think makes most sense from reading the docs. I might try myself when I get around to it, but with x-alarms if you fill in the dimensions of your data stream and set Scaling up for it to scale your service that should work.

I will try to share a feature branch in the week

compose-x / ecs_composex

[FR] Kinesis based service scaling #673