Closed dcfranca closed 1 year ago
I wonder if due to my state using 0.13.6 it might be an issue
You're right. We use Terraform 1.1.9 so the different state version would make it incompatible.
an error looking up for the runner
This error is not harmful. We'll take care of it in coming releases.
Waited for 1.043722403s due to client-side throttling,
This might be because you used 1m
interval, which stressed the control plane a bit. Maybe you would try the bigger value.
Thanks for the quick reply @chanwit I have changed the interval to GitRepo: 5m, Terraform: 10m Updated the Terraform state version to 1.1.9
Deleted the tf-runner
and Terraform resources, and recreated them, but the error is stil there and the log message is still the same
I0927 14:08:09.063345 7 request.go:601] Waited for 1.043137847s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/tyk.tyk.io/v1alpha1?timeout=32s
Maybe it is related, but I also have the issue mentioned here I always need to remove the finalizers manually to be able to delete the resource
However, the Service Account seems to be there, both tf-controller
and tf-runner
ones
Thank you @dcfranca Right, we haven't followed the flowcontrol for fairness yet and we should implement it.
@chanwit Sorry, I'm not following, is there anything I can do to fix or better debug the issue?
Something similar to this: https://github.com/fluxcd/pkg/blob/412e8abfcdec9b2988b4116675f9a2180cfb28d3/runtime/client/client.go
Do you mean that it needs to be implemented on tf-controller? Is there a workaround I can do in the meantime? Or how can I handle that from my side?
Do you mean that it needs to be implemented on tf-controller?
Yes
Is there a workaround I can do in the meantime?
I'm afraid there's no workaround for the client-side throttling. You would only get the messages, but it's not going to make the controller to stop working.
Are you able to get everything up and running for now?
@chanwit No, I'm in the same situation... the terraform resource is stuck in Reconciliation in progress
and the last log information on the runner is the same I shared here.
I see. Both errors might not be directly related to your problem.
How did you setup AWS credentials to access your S3 backend please?
The authentication is via KIAM with the pod annotated with iam.amazonaws.com/role
The error also happens with local backend
@chanwit Running the tf-controller with log in debug mode gives some extra information
{"level":"error","ts":"2022-09-28T09:15:42.760Z","logger":"controller.terraform","msg":"unable to lookup or create runner","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"saas-github","namespace":"flux-system","error":"context deadline exceeded","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2022-09-28T09:15:42.760Z","logger":"controller.terraform","msg":"Reconciler error","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"saas-github","namespace":"flux-system","error":"context deadline exceeded","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2022-09-28T09:15:42.760Z","logger":"controller.terraform","msg":"getting source","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"helloworld","namespace":"flux-system"}
{"level":"info","ts":"2022-09-28T09:15:42.760Z","logger":"controller.terraform","msg":"trigger namespace tls secret generation","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"helloworld","namespace":"flux-system"}
{"level":"info","ts":"2022-09-28T09:15:42.760Z","logger":"cert-rotation","msg":"TLS already generated for ","namespace":"flux-system"}
{"level":"info","ts":"2022-09-28T09:15:42.760Z","logger":"controller.terraform","msg":"show runner pod state: ","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"helloworld","namespace":"flux-system","name":"helloworld","state":"running"}
{"level":"error","ts":"2022-09-28T09:16:12.761Z","logger":"controller.terraform","msg":"unable to lookup or create runner","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"helloworld","namespace":"flux-system","error":"context deadline exceeded","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2022-09-28T09:16:12.761Z","logger":"controller.terraform","msg":"Reconciler error","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"helloworld","namespace":"flux-system","error":"context deadline exceeded","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}
We don't support the local backend as we never tested it inside the cluster.
what's about the standard setup with the default Kubernetes backend? Does it give any error?
what's about the standard setup with the default Kubernetes backend? Does it give any error?
Sorry, I meant that I removed the backend configuration, so I assume it is using the default Kubernetes one?
apiVersion: infra.contrib.fluxcd.io/v1alpha1
kind: Terraform
metadata:
name: helloworld
namespace: flux-system
spec:
interval: 10m
approvePlan: "auto"
path: ./terraform
sourceRef:
kind: GitRepository
name: helloworld
namespace: flux-system
According to the logs,
Do you have any addition info about your cluster? For example, how many seconds for your cluster to bring up your Pods and other workloads in a normal situation?
@chanwit Where do you see in the logs that there is no IP assigned to the pod?
I can list it on my cluster and it has IPs, the assignment of IPs usually happens very fast
helloworld-tf-runner ● 1/1 0 Running 1 10 n/a n/a n/a n/a 10.10.150.55
tf-controller-xxxxxxxxxxxxxxx ● 1/1 0 Running 4 34 2 0 53 3 10.10.146.246
Hi @chanwit thanks for your attention on this issue. I'm on the same team as @dcfranca debugging this with him. I see that there is some sort of mTLS connection between the tf-controller pod and runner pod. Is that true? In the Helm chart, I also see a spot for a policy-agent-tls
secret to be inserted into the tf-controller Deployment, but it's commented-out. Are these the certs that the tf-controller pod needs to establish an mTLS connection with the runner? https://github.com/weaveworks/tf-controller/blob/main/charts/tf-controller/values.yaml#L67-L76
If this is accurate, how are the certs for the tf-controller meant to be created? Is that an exercise left for the user?
@dcfranca I see. The log should have said that it was not able to connect instead then. Thank you for confirming that for me.
Hi @jwolski2 Yes, there's an mTLS connection between the controller and each runner. A runner is going to serve at port 30000, and the controller creates an mTLS/gRPC client to connect to it.
All CA and certificates for mTLS are generated by the controller itself so that we don't need to take care of it. The policy-agent-tls bits you've seen in the values file are not related to the standard setup.
I have only a conclusion for now that there's something prevents the communication between the controller and the runners in your clusters. Probably a kind of security groups or network policies.
We had an internal testing and found that the cause of this problem was about network policies. We put together the preflight checks in the getting started guide: https://weaveworks.github.io/tf-controller/getting_started/
I shall close this issue for now. Please re-open it if the problem still persists.
Our problem didn't actually have to do with network policy. One thing for certain was that this line https://github.com/weaveworks/tf-controller/blob/main/controllers/terraform_controller.go#L1912 was tripping us up. There's a hidden assumption in the implementation of GetRunnerHostname
that pod IPs are resolvable from their fully qualified name. Our CoreDNS was not setup for that. As soon as we added the pods verified
directive to CoreDNS we were able to get past some errors and onto some others:
kubernetes cluster.local {
pods verified
}
We're stumbling across a few other errors having to do now more so with assuming roles from within our Terraform code that uses S3 backends. But we're making some progress and are on our way 🚀 .
Thank you for sharing @jwolski2 That DNS bit is so new to me.
Hey everyone,
I'm just starting using Flux + TF controller, so please let me know if I'm missing some basic step I have it running apparently properly for Kustomization, however, the
tf-controller
keeps hanging inReconciliation in progress
I have created the GitRepo resource, pointing to my repository and correct path The resource was created successfully and it is in
Ready
statebut the Terraform resource is not, and stays forever in
Reconciliation in progress
Here is my manifests: (the Terraform state is an existing one)
On the
flux-system
namespace I see those pods in running stateAnd from the logs of tf-controller:
I see an error looking up for the runner, but I don't see any more information on why it did happen and what is missing I also looked up the documentation and I can't find more details of things that I need to setup
If I see the logs of the
saas-github-tf-runner
the last activity was hours agoI also saw that at the moment there's no way to set the terraform version, I wonder if due to my state using 0.13.6 it might be an issue
Any idea what I'm missing?