Stuck in "Reconciliation in progress" state

dcfranca commented 1 year ago

Hey everyone,

I'm just starting using Flux + TF controller, so please let me know if I'm missing some basic step I have it running apparently properly for Kustomization, however, the tf-controller keeps hanging in Reconciliation in progress

I have created the GitRepo resource, pointing to my repository and correct path The resource was created successfully and it is in Ready state

but the Terraform resource is not, and stays forever in Reconciliation in progress

Here is my manifests: (the Terraform state is an existing one)

apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
  name: myrepo
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/MyOrg/MyRepo.git
  ref:
    branch: master
  secretRef:
    name: flux-system
---
apiVersion: infra.contrib.fluxcd.io/v1alpha1
kind: Terraform
metadata:
  name: saas-github
  namespace: flux-system
spec:
  interval: 1m
  approvePlan: "disable"
  backendConfig:
    customConfiguration: |
      backend "s3" {
        bucket                      = "my-state-bucket"
        key                         = "my-bucket-key"
        region                      = "eu-west-1"
        dynamodb_table              = "lock-table"
        role_arn                    = "arn:aws:iam::XXXXXXX:role/role"
        encrypt                     = true
      }
  path: ./terraform/path
  sourceRef:
    kind: GitRepository
    name: myrepo
    namespace: flux-system

On the flux-system namespace I see those pods in running state

helm-controller-xxxx
kustomize-controller-xxxx
notification-controller-xxxx
saas-github-tf-runner
source-controller-xxxxx
tf-controller-xxxxx

And from the logs of tf-controller:

{"level":"info","ts":"2022-09-27T11:46:43.650Z","logger":"controller.terraform","msg":"getting source","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"saas-github","namespace":"flux-system"}
{"level":"info","ts":"2022-09-27T11:46:43.656Z","logger":"controller.terraform","msg":"trigger namespace tls secret generation","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"saas-github","namespace":"flux-system"}
{"level":"info","ts":"2022-09-27T11:46:43.656Z","logger":"cert-rotation","msg":"TLS already generated for ","namespace":"flux-system"}
{"level":"info","ts":"2022-09-27T11:46:43.657Z","logger":"controller.terraform","msg":"show runner pod state: ","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"saas-github","namespace":"flux-system","name":"saas-github","state":"running"}
{"level":"error","ts":"2022-09-27T11:47:13.657Z","logger":"controller.terraform","msg":"unable to lookup or create runner","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"saas-github","namespace":"flux-system","error":"context deadline exceeded"}
{"level":"error","ts":"2022-09-27T11:47:13.658Z","logger":"controller.terraform","msg":"Reconciler error","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"saas-github","namespace":"flux-system","error":"context deadline exceeded"}
{"level":"info","ts":"2022-09-27T11:47:13.658Z","logger":"controller.terraform","msg":"getting source","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"saas-github","namespace":"flux-system"}
{"level":"info","ts":"2022-09-27T11:47:13.658Z","logger":"controller.terraform","msg":"trigger namespace tls secret generation","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"saas-github","namespace":"flux-system"}
{"level":"info","ts":"2022-09-27T11:47:13.658Z","logger":"cert-rotation","msg":"TLS already generated for ","namespace":"flux-system"}
{"level":"info","ts":"2022-09-27T11:47:13.658Z","logger":"controller.terraform","msg":"show runner pod state: ","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"saas-github","namespace":"flux-system","name":"saas-github","state":"running"}

I see an error looking up for the runner, but I don't see any more information on why it did happen and what is missing I also looked up the documentation and I can't find more details of things that I need to setup

If I see the logs of the saas-github-tf-runner the last activity was hours ago

I0927 09:15:38.384584       7 request.go:601] Waited for 1.043722403s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/discovery.k8s.io/v1beta1?timeout=32s

I also saw that at the moment there's no way to set the terraform version, I wonder if due to my state using 0.13.6 it might be an issue

Any idea what I'm missing?

chanwit commented 1 year ago

I wonder if due to my state using 0.13.6 it might be an issue

You're right. We use Terraform 1.1.9 so the different state version would make it incompatible.

an error looking up for the runner

This error is not harmful. We'll take care of it in coming releases.

Waited for 1.043722403s due to client-side throttling,

This might be because you used 1m interval, which stressed the control plane a bit. Maybe you would try the bigger value.

dcfranca commented 1 year ago

Thanks for the quick reply @chanwit I have changed the interval to GitRepo: 5m, Terraform: 10m Updated the Terraform state version to 1.1.9

Deleted the tf-runner and Terraform resources, and recreated them, but the error is stil there and the log message is still the same

I0927 14:08:09.063345       7 request.go:601] Waited for 1.043137847s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/tyk.tyk.io/v1alpha1?timeout=32s

Maybe it is related, but I also have the issue mentioned here I always need to remove the finalizers manually to be able to delete the resource

However, the Service Account seems to be there, both tf-controller and tf-runner ones

chanwit commented 1 year ago

Thank you @dcfranca Right, we haven't followed the flowcontrol for fairness yet and we should implement it.

dcfranca commented 1 year ago

@chanwit Sorry, I'm not following, is there anything I can do to fix or better debug the issue?

chanwit commented 1 year ago

dcfranca commented 1 year ago

Do you mean that it needs to be implemented on tf-controller? Is there a workaround I can do in the meantime? Or how can I handle that from my side?

chanwit commented 1 year ago

Do you mean that it needs to be implemented on tf-controller?

Yes

Is there a workaround I can do in the meantime?

I'm afraid there's no workaround for the client-side throttling. You would only get the messages, but it's not going to make the controller to stop working.

Are you able to get everything up and running for now?

dcfranca commented 1 year ago

@chanwit No, I'm in the same situation... the terraform resource is stuck in Reconciliation in progress and the last log information on the runner is the same I shared here.

chanwit commented 1 year ago

I see. Both errors might not be directly related to your problem.

How did you setup AWS credentials to access your S3 backend please?

dcfranca commented 1 year ago

The authentication is via KIAM with the pod annotated with iam.amazonaws.com/role

dcfranca commented 1 year ago

The error also happens with local backend

dcfranca commented 1 year ago

@chanwit Running the tf-controller with log in debug mode gives some extra information

{"level":"error","ts":"2022-09-28T09:15:42.760Z","logger":"controller.terraform","msg":"unable to lookup or create runner","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"saas-github","namespace":"flux-system","error":"context deadline exceeded","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2022-09-28T09:15:42.760Z","logger":"controller.terraform","msg":"Reconciler error","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"saas-github","namespace":"flux-system","error":"context deadline exceeded","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":"2022-09-28T09:15:42.760Z","logger":"controller.terraform","msg":"getting source","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"helloworld","namespace":"flux-system"}
{"level":"info","ts":"2022-09-28T09:15:42.760Z","logger":"controller.terraform","msg":"trigger namespace tls secret generation","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"helloworld","namespace":"flux-system"}
{"level":"info","ts":"2022-09-28T09:15:42.760Z","logger":"cert-rotation","msg":"TLS already generated for ","namespace":"flux-system"}
{"level":"info","ts":"2022-09-28T09:15:42.760Z","logger":"controller.terraform","msg":"show runner pod state: ","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"helloworld","namespace":"flux-system","name":"helloworld","state":"running"}
{"level":"error","ts":"2022-09-28T09:16:12.761Z","logger":"controller.terraform","msg":"unable to lookup or create runner","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"helloworld","namespace":"flux-system","error":"context deadline exceeded","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":"2022-09-28T09:16:12.761Z","logger":"controller.terraform","msg":"Reconciler error","reconciler group":"infra.contrib.fluxcd.io","reconciler kind":"Terraform","name":"helloworld","namespace":"flux-system","error":"context deadline exceeded","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}

chanwit commented 1 year ago

We don't support the local backend as we never tested it inside the cluster.

what's about the standard setup with the default Kubernetes backend? Does it give any error?

dcfranca commented 1 year ago

what's about the standard setup with the default Kubernetes backend? Does it give any error?

Sorry, I meant that I removed the backend configuration, so I assume it is using the default Kubernetes one?

apiVersion: infra.contrib.fluxcd.io/v1alpha1
kind: Terraform
metadata:
  name: helloworld
  namespace: flux-system
spec:
  interval: 10m
  approvePlan: "auto"
  path: ./terraform
  sourceRef:
    kind: GitRepository
    name: helloworld
    namespace: flux-system

chanwit commented 1 year ago

According to the logs,

The controller was ok to create a Pod, and the Pod was actually in the "running" state.
But, there's no IP assigned to Pod within 30 seconds, so the deadline exceeded and the controller had started over.

Do you have any addition info about your cluster? For example, how many seconds for your cluster to bring up your Pods and other workloads in a normal situation?

dcfranca commented 1 year ago

@chanwit Where do you see in the logs that there is no IP assigned to the pod?

I can list it on my cluster and it has IPs, the assignment of IPs usually happens very fast

helloworld-tf-runner                     ●  1/1          0 Running    1  10    n/a    n/a    n/a    n/a 10.10.150.55
tf-controller-xxxxxxxxxxxxxxx            ●  1/1          0 Running    4  34      2      0     53      3 10.10.146.246

jwolski2 commented 1 year ago

Hi @chanwit thanks for your attention on this issue. I'm on the same team as @dcfranca debugging this with him. I see that there is some sort of mTLS connection between the tf-controller pod and runner pod. Is that true? In the Helm chart, I also see a spot for a policy-agent-tls secret to be inserted into the tf-controller Deployment, but it's commented-out. Are these the certs that the tf-controller pod needs to establish an mTLS connection with the runner? https://github.com/weaveworks/tf-controller/blob/main/charts/tf-controller/values.yaml#L67-L76

If this is accurate, how are the certs for the tf-controller meant to be created? Is that an exercise left for the user?

chanwit commented 1 year ago

@dcfranca I see. The log should have said that it was not able to connect instead then. Thank you for confirming that for me.

chanwit commented 1 year ago

Hi @jwolski2 Yes, there's an mTLS connection between the controller and each runner. A runner is going to serve at port 30000, and the controller creates an mTLS/gRPC client to connect to it.

All CA and certificates for mTLS are generated by the controller itself so that we don't need to take care of it. The policy-agent-tls bits you've seen in the values file are not related to the standard setup.

chanwit commented 1 year ago

I have only a conclusion for now that there's something prevents the communication between the controller and the runners in your clusters. Probably a kind of security groups or network policies.

chanwit commented 1 year ago

We had an internal testing and found that the cause of this problem was about network policies. We put together the preflight checks in the getting started guide: https://weaveworks.github.io/tf-controller/getting_started/

I shall close this issue for now. Please re-open it if the problem still persists.

jwolski2 commented 1 year ago

Our problem didn't actually have to do with network policy. One thing for certain was that this line https://github.com/weaveworks/tf-controller/blob/main/controllers/terraform_controller.go#L1912 was tripping us up. There's a hidden assumption in the implementation of GetRunnerHostname that pod IPs are resolvable from their fully qualified name. Our CoreDNS was not setup for that. As soon as we added the pods verified directive to CoreDNS we were able to get past some errors and onto some others:

kubernetes cluster.local {
  pods verified
}

We're stumbling across a few other errors having to do now more so with assuming roles from within our Terraform code that uses S3 backends. But we're making some progress and are on our way 🚀 .

chanwit commented 1 year ago

Thank you for sharing @jwolski2 That DNS bit is so new to me.

flux-iac / tofu-controller

Stuck in "Reconciliation in progress" state #365