Closed mrocklin closed 4 years ago
Yes I noticed it yesterday. It used to work, I'm not sure when it started misbehaving. I think it does eventually stop, but as Marcos and you pointed out there's somewhere where a status isn't getting set or is getting reset. (Could be the heartbeat, could be the periodic callback, maybe it's never being set to stopping at all).
I'd like to hide it for the time being until we get a chance to fix it.
Not sure that I was clear above: I think it "works" in that cluster stops, but the status keeps being reset to running until it actually stops.
I added the status over the heartbeat because the only place where we were setting the cluster to "running" was in the register call of the telemetry plugin and it was causing issues during tests where a cluster stayed as "pending". Knowing what I know now I might be able to get rid of it again.
Yeah the scheduler.status is the least reliable of all the statuses :)
Perhaps we shouldn't change the status in the API view and let the status change happen in the websocket? :thinking:
It would be nice to have the number of places we change the status be as small as possible, and was working before. What was the issue you were talking about where a cluster stayed as pending?
The scale api was fetching the scheduler in pending state, the websocket registers and the status is set to running, but then the scale saves the scheduler overriding the status back to pending. That's why you now see a bunch of update_fields=[...]
in https://github.com/coiled/cloud/pull/683
The same thing happens to a bunch of fields ...
The backend fix for this went in a few weeks ago, reenabling stopping in the frontend has been merged to master and should be deployed within days.
When I hit the trash can icon in the Clusters tab the status goes to "stopping" for a second and then jumps back to "running".
I'm guessing that things get reset when the next heartbeat comes in. I'm not sure though.
cc @dantheman39 @marcosmoyano