Closed Nalum closed 2 years ago
Another possible solution would be to have the tf-controller
watching the Lease
s to see if a tf-runner
completes but fails to release the Lease
, the tf-controller
could then force the release of the Lease
, this wouldn't require user intervention and could maybe be a configuration of the tf-controller
or another configuration option of the Terraform
resource.
We found two issues that were the cause of this in two separate cases:
{
"level": "error",
"ts": "2022-06-15T20:14:34.824Z",
"logger": "runner.terraform",
"msg": "error creating the plan",
"error": "exit status 1
Error: Error releasing the state lock
Error message: etcdserver: request timed out
Terraform acquires a lock when accessing your state to prevent others
running Terraform to potentially modify the state at the same time. An
error occurred while releasing this lock. This could mean that the lock
did or did not release properly. If the lock didn't release properly,
Terraform may not be able to run future commands since it'll appear as if
the lock is held.
In this scenario, please call the \"force-unlock\" command to unlock the
state manually. This is a very dangerous operation since if it is done
erroneously it could result in two people modifying state at the same time.
Only call this command if you're certain that the unlock above failed and
that no one else is holding a lock."
}
And
I0615 22:29:32.020709 7 request.go:665] Waited for 1.027123219s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/node.k8s.io/v1?timeout=32s
{\"level\":\"info\",\"ts\":\"2022-06-15T22:29:40.658Z\",\"logger\":\"runner.terraform\",\"msg\":\"creating a plan\"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd165ea]
goroutine 359 [running]:
github.com/hashicorp/terraform-exec/tfexec.(*Terraform).buildTerraformCmd(0x0, {0xc0005e0280, 0x1}, 0x162e028, {0xc000616400, 0x47d9be, 0x0})
/go/pkg/mod/github.com/tf-controller/terraform-exec@v0.15.1-0.20220114135154-c06c64609ee0/tfexec/cmd.go:174 +0x2a
github.com/hashicorp/terraform-exec/tfexec.(*Terraform).planCmd(0xc0003fc120, {0x1847728, 0xc00063a3c0}, {0xc0003fc130, 0x1, 0x10})
/go/pkg/mod/github.com/tf-controller/terraform-exec@v0.15.1-0.20220114135154-c06c64609ee0/tfexec/plan.go:179 +0xb6f
github.com/hashicorp/terraform-exec/tfexec.(*Terraform).Plan(0x1498420, {0x1847728, 0xc00063a3c0}, {0xc0003fc130, 0xf, 0x0})
/go/pkg/mod/github.com/tf-controller/terraform-exec@v0.15.1-0.20220114135154-c06c64609ee0/tfexec/plan.go:100 +0x31
github.com/weaveworks/tf-controller/runner.(*TerraformRunnerServer).Plan(0xc000ad63f0, {0x18477d0, 0xc0005247e0}, 0xc0003c3270)
/workspace/runner/server.go:384 +0x4e6
github.com/weaveworks/tf-controller/runner._Runner_Plan_Handler({0x15dcdc0, 0xc000ad63f0}, {0x18477d0, 0xc0005247e0}, 0xc0005c15c0, 0x0)
/workspace/runner/runner_grpc.pb.go:489 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0005c8000, {0x1873270, 0xc000603380}, 0xc0004c0000, 0xc000614cc0, 0x23c99c0, 0x0)
/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1282 +0xccf
google.golang.org/grpc.(*Server).handleStream(0xc0005c8000, {0x1873270, 0xc000603380}, 0xc0004c0000, 0x0)
/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1619 +0xa2a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:921 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:919 +0x294
To expand on the above I'm looking at adding the following to the spec
:
spec:
backendConfig:
state:
forceUnlock: [holder-id]
By default, and following the same path as Terraform, this would require manual intervention by an operator, who would need to put this [holder-id]
in place in the the Terraform
resource. This can be done by changing the resource in the Flux Source
or by running tfctl force-unlock [terraform-resource] [holder-id]
, which would set the value and then trigger a reconcile for that resource.
Does tfctl reconcile
do the reconcile right away or is it queued up and take place X
time after you run the command? Ideally we'd like the tfctl force-unlock
to trigger a reconciliation immediately so that the Terraform
resource doesn't get reset by Flux before the reconcile happens.
I think it also makes sense to allow an automated in which case you would set spec.backendConfig.state.forceUnlock: auto
, this allows for the use case where the assumption is that nothing else will ever be using this state, so we are okay with the tf-controller
automatically unlocking the state. Obviously putting all the warnings around this in the documentation.
@chanwit @tomhuang12 @phoban01 what do you think about the above?
I have changed the above noted structure as it impacts the backendConfig
in a way that seems to complicate it more than I'd like.
So I've moved spec.backendConfig.state
out to spec.terraformState
and changed the structure of this a little to the following:
spec:
terraformState:
forceUnlock: (yes|no|auto)
lockIdentifier: [holder-id]
spec.terraformState.forceUnlock
defaults to no
Thank you @Nalum This design is excellent. It gives us auditable activities for force unlocking.
spec:
terraformState:
forceUnlock: (yes|no|auto)
lockIdentifier: [holder-id]
There are a few things to improve UX.
terraformState
to tfstate
so users would be able to reason that they're working with tfstate
at the moment.forceUnlock=yes
and forceUnlock=auto
? How are they different?Thanks for the feedback 👍
There are a few things to improve UX.
- I would change
terraformState
totfstate
so users would be able to reason that they're working withtfstate
at the moment.
Ah yes, this makes sense, will make this change 👍
- What are the semantics of
forceUnlock=yes
andforceUnlock=auto
? How are they different?
The purpose of auto
is to have the tf-controller
automatically unlock the state, in the case where the user is happy for this to happen, and yes
is a manual trigger, the expectation being that it is triggered from the tfctl
.
This has just made me think about some of the implementation and I need to revisit it.
Closing as #268 has been merged!
We've hit an issue where the State lock was not released due to the runner pod not shutting down correctly (assumption here) due to a node refresh on an EKS Cluster. So we ended up with a Kubernetes
Lease
that was still held.The
Terraform
resource was reporting the following:Which was repeated in the
tf-controller
andtf-runner
logs. We were able to get things running again by deleting theLease
in question.So the ask here is to support the Terraform force-unlock feature.
I think the best approach might be to add a new field to the
TerraformSpec
, e.g.forceUnlock
, the value of which should be theLock Info: ID: [value]
, in the above example this would beforceUnlock: f2ab685b-f84d-ac0b-a125-378a22877e8d
. This ID corresponds to the keyspec.holderIdentity
on aLease
.The
tf-controller
/runner
should be tasked with acting on this, allowing for unlocking other Terraform backends that support locking if they become supported by thetf-controller
.This could then be supported in the
tfctl
e.g.tfctl force-unlock f2ab685b-f84d-ac0b-a125-378a22877e8d
which would set the field on the Terraform resource and force a reconciliation. Is there a potential race condition here?