aws-controllers-k8s / community

AWS Controllers for Kubernetes (ACK) is a project enabling you to manage AWS services from Kubernetes
https://aws-controllers-k8s.github.io/community/
Apache License 2.0
2.36k stars 248 forks source link

Sagemaker Controller supports UpdateWeights API #1875

Open mwm5945 opened 10 months ago

mwm5945 commented 10 months ago

Is your feature request related to a problem? The Sagemaker controller doesn't currently support updating weights and capacities (UpdateEndpointWeightsAndCapacities API). Currently, to update these values, users will need to create a new endpoint config, create it, then update the endpoint to use the new config.

Describe the solution you'd like Ideally, updating the Endpoint would take care of this, new fields may be required to allow for this API to used when the Endpoint spec is changed. Updating the EndpointConfig may not be the best option as multiple Endpoints could be using the same config.

Describe alternatives you've considered The only available option at this time is to create new EndpointConfigs, and update the endpoint to use that config. This leads to extraneous EndpointConfigs hanging around if not cleaned up, as well as a less than ideal user experience when it comes to A/B testing of models.

a-hilaly commented 10 months ago

/cc @surajkota @aws-controllers-k8s/sagemaker-maintainer

surajkota commented 10 months ago

Hi @mwm5945, thanks for opening the issue. We are aware that SageMaker controller does not support this functionality and part of it is by design given the nature of K8s controller. The UpdateEndpointWeightsAndCapacities API supports updating weight and instance count (or concurrency parameters incase of serverless endpoint) for an arbitrary variant. Lets take case of real time endpoint

  1. The desired instance count on an endpoint can be controlled in 2ways – using UpdateEndpointWeightsAndCapacities API or by setting up autoscaling on the variant. K8s controller cannot differentiate between these two events. The controller will try to adjust the instance count as specified in the spec which can lead to unintended behavior if the variant has autoscaling configured. The same behavior can be achieved by using minCapacity of Application autoscaling, e.g. autoscaling spec which can be used as an alternative and provides more configurations for production usecases
  2. Similar but slightly different case for variant weight, it is a property of the endpoint config and not the endpoint itself. We dont want to run into scenarios in future where service might introduce functionality to adjust this value based on certain configurations. Also, DesiredVariantWeight is a update only parameter and the shapes in describe and update API vary significantly making it hard to maintain.

It is actually a better approach maintain one endpoint config per endpoint. This will have 2 benefits, 1/ You have a safer way to determine if an endpoint config can be deleted, it will save you from cases where a different endpoint was using the config and the config gets accidentally deleted , 2/ you can use it to adjust the weights of the variant and maintain that configuration in one place.

Summarize - Use autoscaling to adjust desired instance count and endpointConfig to adjust variant weight. Let us know if this works for your usecase.

Thanks

mwm5945 commented 10 months ago

i see, thanks @surajkota--though one use case is for something like a scare instance type (i.e. p4d), where creating new instances may be difficult/impossible, just to update the weights :/

somewhat related, i've created a new issue: https://github.com/aws-controllers-k8s/community/issues/1889

surajkota commented 10 months ago

Synced offline to understand the priority. Will keep this issue open incase the workaround for updating desired weights creates operational complexity and if there are other users who are impacted by this.

Thanks

ack-bot commented 4 months ago

Issues go stale after 180d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 60d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Provide feedback via https://github.com/aws-controllers-k8s/community. /lifecycle stale

gecube commented 3 months ago

/remove-lifecycle stale