Previously our handling of this was to just clean up everything if it was already removed from the load balancer. This PR adds a few things to the reregistration flow:
Always enqueue a pending request to reevaluate state/scale so that we don't end up with a multiple instance 1s kind of state and get stuck there
If the task is recoverable in zk (not persisted yet) but has already been removed from the lb. Instead of cleaning it up, treat it as if it were a new healthy task that just finished passing healthchecks. This way we can recover running tasks, especially in cases of large network partitions where relaunching that many new things could take some large amount of time
Updated the unit test for this to make sure the lb pending add is present, but would appreciate extra 👀 on it
Previously our handling of this was to just clean up everything if it was already removed from the load balancer. This PR adds a few things to the reregistration flow:
Updated the unit test for this to make sure the lb pending add is present, but would appreciate extra 👀 on it
cc @pschoenfelder @ajammala @rosalind210