coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Stopping cluster via web UI seems ineffective #49

Closed mrocklin closed 4 years ago

mrocklin commented 4 years ago

When I hit the trash can icon in the Clusters tab the status goes to "stopping" for a second and then jumps back to "running".

I'm guessing that things get reset when the next heartbeat comes in. I'm not sure though.

cc @dantheman39 @marcosmoyano

marcosmoyano commented 4 years ago

most likely via: https://github.com/coiled/cloud/blob/master/cloud/preload_scripts/telemetry-preload.py#L207

dantheman39 commented 4 years ago

Yes I noticed it yesterday. It used to work, I'm not sure when it started misbehaving. I think it does eventually stop, but as Marcos and you pointed out there's somewhere where a status isn't getting set or is getting reset. (Could be the heartbeat, could be the periodic callback, maybe it's never being set to stopping at all).

I'd like to hide it for the time being until we get a chance to fix it.

dantheman39 commented 4 years ago

Not sure that I was clear above: I think it "works" in that cluster stops, but the status keeps being reset to running until it actually stops.

marcosmoyano commented 4 years ago

I added the status over the heartbeat because the only place where we were setting the cluster to "running" was in the register call of the telemetry plugin and it was causing issues during tests where a cluster stayed as "pending". Knowing what I know now I might be able to get rid of it again.

dantheman39 commented 4 years ago

Yeah the scheduler.status is the least reliable of all the statuses :)

marcosmoyano commented 4 years ago

Perhaps we shouldn't change the status in the API view and let the status change happen in the websocket? :thinking:

dantheman39 commented 4 years ago

It would be nice to have the number of places we change the status be as small as possible, and was working before. What was the issue you were talking about where a cluster stayed as pending?

marcosmoyano commented 4 years ago

The scale api was fetching the scheduler in pending state, the websocket registers and the status is set to running, but then the scale saves the scheduler overriding the status back to pending. That's why you now see a bunch of update_fields=[...] in https://github.com/coiled/cloud/pull/683 The same thing happens to a bunch of fields ...

dantheman39 commented 4 years ago

The backend fix for this went in a few weeks ago, reenabling stopping in the frontend has been merged to master and should be deployed within days.