Handle tf state lock during the controller pod restart/upgrade

ytsarev commented 2 years ago

What problem are you facing?

There can be situations when the controller pod is getting restarted, for example during the provider upgrade process. Depending on the moment of time it can lead to the state lock and Workspace MR malfunction within the new pod

    - lastTransitionTime: "2022-04-07T08:57:04Z"
      message: 'observe failed: cannot diff (i.e. plan) Terraform configuration: error
        acquiring the state lock'
      reason: ReconcileError
      status: "False"
      type: Synced

How could Crossplane help solve your problem?

We need to find a way to gracefully handle lock release mechanism in case of controller pod recreation.

bobh66 commented 2 years ago

@ytsarev I'm seeing this same problem with stuck Lease object holderIdentity values.

Maybe something like this in the pod configuration:

  lifecycle:
    preStop:
      exec:
        command: ["/bin/sh","-c","while killall -HUP terraform; do sleep 1; done"]

I'm going to add it to my ControllerConfig and see if it has any impact, but it will probably take a few days to be confident if it does resolve the problem.

bobh66 commented 2 years ago

Unfortunately the ControllerConfig schema doesn't support adding lifecycle hooks to the pods so I can't test this as easily as I thought.

chlunde commented 2 years ago

I wonder if a few other options are possible too:

we increase the default termination grace period, and then you can handle this in the signal handler on the provider process via main.go
always inject a preStop exec like this command: ["/bin/sh","-c","test -x /bin/prestop && /bin/prestop"]. Then you any provider could place whatever they want into /bin/prestop and the user will not have to configure a controllerconfig.
or a bit more explicit, add support for lifecycle in the package/crossplane.yaml file

these are just some ideas, all with the goal of moving this burden away from the N people deploying the operator to the developer of the operator.

I also wonder to what degree the provider-runtime by default, can and should attempt a clean shutdown while not inside a partially complete create/update/delete operation. Maybe in combination with alternative one.

bobh66 commented 2 years ago

@chlunde I'm happy to implement alternative 1 for provider-terraform as it seems like the cleanest option.

Since the controller already sets up a signal handler here:

https://github.com/crossplane-contrib/provider-terraform/blob/master/cmd/provider/main.go#L69

I assume that there must be a way to attach a specific implementation to that signal handler, or to the base context, so that it gets called when the first cancel() is triggered. The killall -HUP terraform can be executed at that point. It may also be a good idea to set a lock/semaphore at that point so no additional work is processed by the reconciliation loop after the first signal is received. The loop would need to check the flag on each iteration to be sure that shutdown has not been requested.

I'm not a go programmer by trade so if there is an example of how to set the signal handler that would be helpful.

Thanks

bobh66 commented 2 years ago

The existing code uses libexec to run terraform cli commands, and provides a CommandContext so that the exec'd process is killed when the main process is killed:

https://cs.opensource.google/go/go/+/refs/tags/go1.18.3:src/os/exec/exec.go;l=454

This is killing the terraform process using Kill() which sends SIGKILL and terminates the process immediately.

If we're going to give terraform a chance to release the lock then we need to send SIGTERM first and then SIGKILL if the process doesn't clean up by itself within some interval. libexec does not appear to support that so we may need to roll our own implementation.

We may also want to call "terraform force-unlock" on the workspace if the process is killed in order to remove the lock on the workspace.

This might require running the terraform cli command in a goroutine and catching the ctx.Done() signal to trigger the unlock?

ytsarev commented 2 years ago

@bobh66 according to my understanding it should get SIGTERM first before terminationGracePeriod is over, then only SIGKILL is sent. Default 30s is just probably not enough for average terraform apply to gracefully shutdown. All above is an assumption that requires runtime proof and debugging.

https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace

bobh66 commented 2 years ago

@ytsarev - if I understand correctly, and I'm not sure that I do, the controller-runtime sets up a signal handler for the manager and when the initial SIGTERM is received by the manager process it calls cancel() on the context here:

https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/manager/signals/signal.go#L39

That seems like it will get consumed by the CommandContext that is running the terraform command:

https://cs.opensource.google/go/go/+/refs/tags/go1.18.3:src/os/exec/exec.go;l=453

and send a SIGKILL signal to the running terraform process:

https://cs.opensource.google/go/go/+/refs/tags/go1.18.3:src/os/exec/exec.go;l=454

I don't see anything in the libexec code that sends a SIGTERM to the terraform process, which is what we would need to shutdown the terraform command cleanly.

I may be missing something, but this is what I can see so far.

ytsarev commented 2 years ago

@bobh66 that makes total sense, I plan to enable more verbose logging and ability to capture it before attacking this issue, because currently, we are operating almost blind

bobh66 commented 2 years ago

I suspect that the default 20m timeout that the process is running with could also cause a similar issue - if the running terraform command takes more than 20m and gets killed by the timeout, it will result in the same symptoms as when the entire process gets killed. The final solution probably needs to take both scenarios into account.

Also the default terraform SDK timeouts are 20m:

https://www.terraform.io/plugin/sdkv2/resources/retries-and-customizable-timeouts#default-timeouts-and-deadline-exceeded-errors

so maybe the default reconcile process timeout should be slightly longer - 25m? - to allow the terraform to timeout and clean up after itself before the thread gets killed.

bobh66 commented 2 years ago

@ytsarev - I wonder if there is a way to allow the running terraform commands to complete while still starting a new controller pod? Maybe using a StatefulSet instead of a ReplicaSet? Can we somehow cause the "old" controller pod to stop accepting new work but allow the outstanding terraform commands to run to completion, while starting a new pod and directing all new reconciliation work to it? It's a more complicated upgrade process than the standard kubernetes controller, but it would avoid the issues with state locks since the new pod will fail to run on any workspaces that are still locked by the old pod. And it may be easier than trying to orchestrate sending SIGTERM/SIGHUP to the running terraform processes and waiting for them to terminate. It would similar to treating the terraform commands as batch jobs that are going to run to completion no matter what happens to the controller.

bobh66 commented 2 years ago

Or we could simply use exec.Command instead of exec.CommandContext when calling apply and destroy, which would leave those processes to finish on their own when the controller is killed (I think) - we might also need to set the Pgid to 0 so the child can run after it's parent dies.

https://github.com/crossplane-contrib/provider-terraform/blob/master/internal/terraform/terraform.go#L457

bobh66 commented 2 years ago

@ytsarev - maybe something like this would work? I think we would need to use this for Init, Apply and Destroy since those are the only state-locking commands we use.

type cmdResult struct {
        out []byte
        err error
}

// runCommand executes the requested command and sends the process SIGTERM if the context finishes before the command
func runCommand(ctx context.Context, c *exec.Cmd) ([]byte, error) {
        ch := make(chan cmdResult, 1)
        go func() {
                defer close(ch)
                r, e := c.Output()
                ch <- cmdResult{out: r, err: e}
        }()
        select {
        case <-ctx.Done():
                // This could be container termination or the reconcile deadline exceeded.  Either way send a
                // SIGTERM to the running process and wait for either the command to finish or the process to get killed.
                _ = c.Process.Signal(syscall.SIGTERM)
                err := c.Wait()
                return nil, err
        case res := <-ch:
                return res.out, res.err
        }
}

ytsarev commented 2 years ago

@bobh66 that looks promising! it should handle it gracefully even after terminationGracePeriod is over and SIGKILL is sent to main manager process, is that correct?

bobh66 commented 2 years ago

That's my understanding - after the SIGTERM is sent the child process will either return and the Wait() will return, or it will get SIGKILLed. Either way we at least gave the child process a fighting chance to clean up after itself.

I think it should retrieve ctx.Err() and return that so the caller knows about the error (context,Cancelled or context.DeadlineExceeded)

I'll run some tests in our environment and see what happens.

bobh66 commented 2 years ago

This code seems to work as expected, in that it is sending SIGTERM to running terraform processes. It also effectively prevents new terraform commands from running once ctx.Done() is set because it falls through and sends SIGTERM to the command that it just started. I think this should be changed to check ctx.Done() or ctx.Err() before starting the command in the first place and just return an error.

The reconciler continues to run during the 30 second grace period and all reconciliations fail in Setup() because Init() fails.

I'm wondering if this should be handled differently in crossplane-runtime or even in controller-runtime - it seems like if ctx.Done() is set then the controller should not process any new reconciliations and should not requeue failures, to allow things to clean themselves up.

It doesn't appear that there is any "exit path" in the code - the controller expects to run continuously and get killed when it should stop, it won't ever stop processing and return cleanly.

If that's the case then maybe when ctx->Done() is set the Reconciler should not requeue to let the workload drop to 0, or close to 0 so the pod can be killed?

I think this change will do what we want but I think it's also exposing larger questions that need to be investigated.

ulucinar commented 10 months ago

Hi @bobh66, Do we know how common is remote state sharing between the Crossplane provider-terraform and the other Terraform clients?

For the upjet-based providers, we support an asynchronous reconciliation mode which is configured with the config.Resource.UseAsync argument. This is the default reconciliation mode for quite a long time. I feel like we need a similar async reconciliation mode for provider-terraform for robustness in addition to properly handling SIGTERM as you proposed in https://github.com/crossplane-contrib/provider-terraform/pull/76. How do you feel about this?

bobh66 commented 10 months ago

Hi @ulucinar - I don't know of any use cases where terraform state is shared.

I agree an async mode would be useful and if it would help with the SIGTERM case that's an added bonus. I can check to see if this issue has an entry in the upbound/provider-terraform and create one if it doesn't. Thanks

crossplane-contrib / provider-terraform

Handle tf state lock during the controller pod restart/upgrade #46

What problem are you facing?

How could Crossplane help solve your problem?