Open ospillinger opened 4 years ago
I have turned off rolling updates with documentation recommendation of max_surge: 0
and I seem to keep getting into compute unavailable
statuses. The only way it seems to resolve this is having to go through cortex cluster down
then cortex cluster up
then it seems to start working again... This is obviously not ideal for deploying some updates I'm unsure why this is happening with my system, I'm using python, distilgpt-2 with m5.large, min/max instances set to 1...
@ariel-frischer Do you see any useful info in cortex logs <api_name>
? If not, the next time this happens, do you mind running cortex cluster info --debug
, and sending the resulting zip file (which contains the full cluster state) to dev@cortex.dev? We'd be happy to take a look to see what's going on!
@ariel-frischer Do you see any useful info in
cortex logs <api_name>
? If not, the next time this happens, do you mind runningcortex cluster info --debug
, and sending the resulting zip file (which contains the full cluster state) to dev@cortex.dev? We'd be happy to take a look to see what's going on!
@deliahu Usually the logs just stop updating, or don't show anything at all. I will send you guys the zip file when I come across this again. Thank you for the support!
Description
When running
cortex deploy
in a situation when a rolling update cannot be performed (based onmax_surge
/max_unavailable
), respond with an error rather than reaching a deadlocked state. In the error, mention how to resolve it (increasemax_instances
or updateupdate_strategy
)Delete the relevant section in the "stuck updating" guide
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment