Terraform State Lock not released

flux-iac / tofu-controller

A GitOps OpenTofu and Terraform controller for Flux

https://flux-iac.github.io/tofu-controller/

Apache License 2.0

1.26k stars 136 forks source link

Terraform State Lock not released #250

Closed Nalum closed 2 years ago

Nalum commented 2 years ago

We've hit an issue where the State lock was not released due to the runner pod not shutting down correctly (assumption here) due to a node refresh on an EKS Cluster. So we ended up with a Kubernetes Lease that was still held.

$ kubectl -n flux-system get lease
NAMESPACE     NAME                      HOLDER                                AGE
...
flux-system   lock-tfstate-default-sg1                                        3d7h
flux-system   lock-tfstate-default-sg2  f2ab685b-f84d-ac0b-a125-378a22877e8d  3d7h

The Terraform resource was reporting the following:

error running Plan: rpc error: code = Unknown desc = error acquiring the state lock: Lock Info: ID: f2ab685b-f84d-ac0b-a125-378a22877e8d Path: Operation: OperationTypePlan Who: controller@sg2-tf-runner Version: 1.1.9 Created: 2022-06-15 22:27:10.244794577 +0000 UTC

Which was repeated in the tf-controller and tf-runner logs. We were able to get things running again by deleting the Lease in question.

So the ask here is to support the Terraform force-unlock feature.

I think the best approach might be to add a new field to the TerraformSpec, e.g. forceUnlock, the value of which should be the Lock Info: ID: [value], in the above example this would be forceUnlock: f2ab685b-f84d-ac0b-a125-378a22877e8d. This ID corresponds to the key spec.holderIdentity on a Lease.

The tf-controller/runner should be tasked with acting on this, allowing for unlocking other Terraform backends that support locking if they become supported by the tf-controller.

This could then be supported in the tfctl e.g. tfctl force-unlock f2ab685b-f84d-ac0b-a125-378a22877e8d which would set the field on the Terraform resource and force a reconciliation. Is there a potential race condition here?

Nalum commented 2 years ago

Another possible solution would be to have the tf-controller watching the Leases to see if a tf-runner completes but fails to release the Lease, the tf-controller could then force the release of the Lease, this wouldn't require user intervention and could maybe be a configuration of the tf-controller or another configuration option of the Terraform resource.

Nalum commented 2 years ago

We found two issues that were the cause of this in two separate cases:

{
  "level": "error",
  "ts": "2022-06-15T20:14:34.824Z",
  "logger": "runner.terraform",
  "msg": "error creating the plan",
  "error": "exit status 1
    Error: Error releasing the state lock
    Error message: etcdserver: request timed out
    Terraform acquires a lock when accessing your state to prevent others
    running Terraform to potentially modify the state at the same time. An
    error occurred while releasing this lock. This could mean that the lock
    did or did not release properly. If the lock didn't release properly,
    Terraform may not be able to run future commands since it'll appear as if
    the lock is held.

    In this scenario, please call the \"force-unlock\" command to unlock the
    state manually. This is a very dangerous operation since if it is done
    erroneously it could result in two people modifying state at the same time.
    Only call this command if you're certain that the unlock above failed and
    that no one else is holding a lock."
}

And

I0615 22:29:32.020709       7 request.go:665] Waited for 1.027123219s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/node.k8s.io/v1?timeout=32s
{\"level\":\"info\",\"ts\":\"2022-06-15T22:29:40.658Z\",\"logger\":\"runner.terraform\",\"msg\":\"creating a plan\"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd165ea]

goroutine 359 [running]:
github.com/hashicorp/terraform-exec/tfexec.(*Terraform).buildTerraformCmd(0x0, {0xc0005e0280, 0x1}, 0x162e028, {0xc000616400, 0x47d9be, 0x0})
  /go/pkg/mod/github.com/tf-controller/terraform-exec@v0.15.1-0.20220114135154-c06c64609ee0/tfexec/cmd.go:174 +0x2a
github.com/hashicorp/terraform-exec/tfexec.(*Terraform).planCmd(0xc0003fc120, {0x1847728, 0xc00063a3c0}, {0xc0003fc130, 0x1, 0x10})
  /go/pkg/mod/github.com/tf-controller/terraform-exec@v0.15.1-0.20220114135154-c06c64609ee0/tfexec/plan.go:179 +0xb6f
github.com/hashicorp/terraform-exec/tfexec.(*Terraform).Plan(0x1498420, {0x1847728, 0xc00063a3c0}, {0xc0003fc130, 0xf, 0x0})
  /go/pkg/mod/github.com/tf-controller/terraform-exec@v0.15.1-0.20220114135154-c06c64609ee0/tfexec/plan.go:100 +0x31
github.com/weaveworks/tf-controller/runner.(*TerraformRunnerServer).Plan(0xc000ad63f0, {0x18477d0, 0xc0005247e0}, 0xc0003c3270)
  /workspace/runner/server.go:384 +0x4e6
github.com/weaveworks/tf-controller/runner._Runner_Plan_Handler({0x15dcdc0, 0xc000ad63f0}, {0x18477d0, 0xc0005247e0}, 0xc0005c15c0, 0x0)
  /workspace/runner/runner_grpc.pb.go:489 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0005c8000, {0x1873270, 0xc000603380}, 0xc0004c0000, 0xc000614cc0, 0x23c99c0, 0x0)
  /go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1282 +0xccf
google.golang.org/grpc.(*Server).handleStream(0xc0005c8000, {0x1873270, 0xc000603380}, 0xc0004c0000, 0x0)
  /go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1619 +0xa2a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
  /go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:921 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
  /go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:919 +0x294

Nalum commented 2 years ago

To expand on the above I'm looking at adding the following to the spec:

spec:
  backendConfig:
    state:
      forceUnlock: [holder-id]

By default, and following the same path as Terraform, this would require manual intervention by an operator, who would need to put this [holder-id] in place in the the Terraform resource. This can be done by changing the resource in the Flux Source or by running tfctl force-unlock [terraform-resource] [holder-id], which would set the value and then trigger a reconcile for that resource.

Does tfctl reconcile do the reconcile right away or is it queued up and take place X time after you run the command? Ideally we'd like the tfctl force-unlock to trigger a reconciliation immediately so that the Terraform resource doesn't get reset by Flux before the reconcile happens.

I think it also makes sense to allow an automated in which case you would set spec.backendConfig.state.forceUnlock: auto, this allows for the use case where the assumption is that nothing else will ever be using this state, so we are okay with the tf-controller automatically unlocking the state. Obviously putting all the warnings around this in the documentation.

@chanwit @tomhuang12 @phoban01 what do you think about the above?

Nalum commented 2 years ago

I have changed the above noted structure as it impacts the backendConfig in a way that seems to complicate it more than I'd like.

So I've moved spec.backendConfig.state out to spec.terraformState and changed the structure of this a little to the following:

spec:
  terraformState:
    forceUnlock: (yes|no|auto)
    lockIdentifier: [holder-id]

spec.terraformState.forceUnlock defaults to no

chanwit commented 2 years ago

Thank you @Nalum This design is excellent. It gives us auditable activities for force unlocking.

spec:
  terraformState:
    forceUnlock: (yes|no|auto)
    lockIdentifier: [holder-id]

There are a few things to improve UX.

I would change terraformState to tfstate so users would be able to reason that they're working with tfstate at the moment.
What are the semantics of forceUnlock=yes and forceUnlock=auto? How are they different?

Nalum commented 2 years ago

Thanks for the feedback 👍

There are a few things to improve UX.

I would change terraformState to tfstate so users would be able to reason that they're working with tfstate at the moment.

Ah yes, this makes sense, will make this change 👍

What are the semantics of forceUnlock=yes and forceUnlock=auto? How are they different?

The purpose of auto is to have the tf-controller automatically unlock the state, in the case where the user is happy for this to happen, and yes is a manual trigger, the expectation being that it is triggered from the tfctl.

This has just made me think about some of the implementation and I need to revisit it.

Nalum commented 2 years ago

Closing as #268 has been merged!