Prevent or warn about deadlocked rolling updates

cortexlabs / cortex

Production infrastructure for machine learning at scale

https://cortexlabs.com/

Apache License 2.0

8.02k stars 606 forks source link

Prevent or warn about deadlocked rolling updates #644

Open ospillinger opened 4 years ago

ospillinger commented 4 years ago

Description

When running cortex deploy in a situation when a rolling update cannot be performed (based on max_surge / max_unavailable), respond with an error rather than reaching a deadlocked state. In the error, mention how to resolve it (increase max_instances or update update_strategy)

Delete the relevant section in the "stuck updating" guide

https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment

ariel-frischer commented 3 years ago

I have turned off rolling updates with documentation recommendation of max_surge: 0 and I seem to keep getting into compute unavailable statuses. The only way it seems to resolve this is having to go through cortex cluster down then cortex cluster up then it seems to start working again... This is obviously not ideal for deploying some updates I'm unsure why this is happening with my system, I'm using python, distilgpt-2 with m5.large, min/max instances set to 1...

deliahu commented 3 years ago

@ariel-frischer Do you see any useful info in cortex logs <api_name>? If not, the next time this happens, do you mind running cortex cluster info --debug, and sending the resulting zip file (which contains the full cluster state) to dev@cortex.dev? We'd be happy to take a look to see what's going on!

ariel-frischer commented 3 years ago

@ariel-frischer Do you see any useful info in cortex logs <api_name>? If not, the next time this happens, do you mind running cortex cluster info --debug, and sending the resulting zip file (which contains the full cluster state) to dev@cortex.dev? We'd be happy to take a look to see what's going on!

@deliahu Usually the logs just stop updating, or don't show anything at all. I will send you guys the zip file when I come across this again. Thank you for the support!