Open o-orand opened 2 years ago
This boils down to an automatic reconciler which is mentioned in issue https://github.com/isaaguilar/terraform-operator/issues/84.
The idea of auto reconciliation sounds nice but I'd need to put some thought into how it would actually work.
I could think of a workaround that might help.
Here's how it works:
spec.env
with something like a revision number.Personally, I have used an env like the following to force trigger builds:
kind: Terraform
metadata:
name: my-tfo-resource
spec:
...
env:
- name: _REVISION
value: "10" # a counter or random string would work
If you have a setup like above, you should be able to write a cron or an infinite loop to change the "_REVISION".
while true; do
kubectl patch terraform my-tfo-resource --type json -p '[
{
"op": "replace",
"path": "/spec/env/0",
"value": {"name":"_REVISION","value":"'$RANDOM'"}
}
]'
sleep 600
done
Every 10 minutes, this script will update the terraform which will auto-reconcile.
Thanks @isaaguilar for the workaround. I will give it a try.
This workaround works quite well, I've used CronJob
, instead of job.
The main drawback is persistence volume consumption, as providers are downloaded on each run. So we have to use cleanupDisk
, but we may loose root cause error on failure. Another alternative, is to implement a custom cleanup to remove unwanted data.
Below some extracts of my configuration:
apiVersion: batch/v1
kind: CronJob
metadata:
name: tfo-reconciliation-workaround-job
spec:
# currently triggering manually
schedule: "1/5 * * * *"
failedJobsHistoryLimit: 5
jobTemplate:
spec:
ttlSecondsAfterFinished: 86400 #keep the job for 24 hours to access its logs
template:
spec:
serviceAccountName: my-service-account
containers:
- name: update-tfo-reconciliation-marker
image: bitnami/kubectl:1.21
securityContext:
runAsUser: 0
command:
- '/bin/sh'
- '-c'
- '/scripts/workaround.sh' #This script is the kubectl patch command you provided previously
volumeMounts:
- name: script
mountPath: "/scripts/workaround.sh"
subPath: workaround.sh
volumes:
- name: script
configMap:
name: tfo-reconciliation-workaround
defaultMode: 0777
restartPolicy: Never
It also requires specifics roles to interact with terraform operator:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: tfo-workaround-role
namespace: <my-namespace>
rules:
- apiGroups: [""]
resources:
- secrets
- configmaps
verbs:
- "*"
- apiGroups:
- tf.isaaguilar.com
resources:
- terraforms
verbs:
- list
- get
- patch
Hello @isaaguilar, after running this workaround for a few weeks, we've hit a limitation: new ConfigMap and Secrets are generated on each run, and kept forever. See sample below
kubectl get terraforms.tf.isaaguilar.com
tf-harbor-internet 19d
tf-harbor-intranet 19d
kubectl get configmaps,secrets|grep "tf-harbor"|cut -d'-' -f1-3|sort|uniq -c
3210 configmap/tf-harbor-internet
2598 configmap/tf-harbor-intranet
3788 secret/tf-harbor-internet
2858 secret/tf-harbor-intranet
Is it possible to kept only last xxx executions ?
A few hours ago I released v0.8.2, which changes the behavior of keepLatestPodsOnly
which does much better cleanup. https://github.com/isaaguilar/terraform-operator/releases/tag/v0.8.2
kind: Terraform
metadata:
name: my-tfo-resource
spec:
...
keepLatestPodsOnly: true
That should clear out old resources and keep only the latest. The ones that got created before will need to be manually cleared unfortunately.
Thanks ! I've installed latest version and it works better.
I've noticed that operator is killed due to Out Of Memory, but everything seems fine in log.
- containerID: containerd://8ebf83d8d5d36bf0828c4f9262fe188d98a1356cffea470c920236a2428443d4
image: docker.io/isaaguilar/terraform-operator:v0.8.2
imageID: docker.io/isaaguilar/terraform-operator@sha256:319a86bad4bb657dc06f51f5f094639f37bceca2b0dd3255e5d1354d601270b2
lastState:
terminated:
containerID: containerd://8ebf83d8d5d36bf0828c4f9262fe188d98a1356cffea470c920236a2428443d4
exitCode: 137
finishedAt: "2022-06-10T08:32:41Z"
reason: OOMKilled
startedAt: "2022-06-10T08:31:56Z"
name: terraform-operator
ready: false
restartCount: 167
started: false
I will try to increase allocated memory, and see :) Nevertheless, it seems related to number of secrets and configmaps. I've deleted old secrets and configmap but OOM is still here
kubectl get configmaps,secrets |grep "tf-harbor"|cut -d'-' -f1-3|sort|uniq -c
1 configmap/tf-harbor-internet
1 configmap/tf-harbor-intranet
2 secret/tf-harbor-internet
1 secret/tf-harbor-intranet
I'd be interested in knowing how much memory was allocated, total number of 'tf' resources.
# total tf
kubectl get tf --all-namespaces | wc -l
Maybe also some metrics on total number of pods since the operator has a watch on pod events as well.
Allocated memory was default value (128M). I've increased allocated memory to 256M, and now, tf operator seems fine. For tf resources it's easy, we only have 2...
For number of pods
kubectl get pods --all-namespaces --no-headers|wc -l
103
I'm facing another issue making the workaround to fail: the yaml associated to terraform resource is to big, see message below
2022-06-20T08:21:16.456Z DEBUG terraform_controller failed to update tf status: rpc error: code = ResourceExhausted desc = trying to send message larger than max (2097279 vs. 2097152) {"Terraform": "10-harbor-registry/tf-harbor-intranet", "id": "8564c063-e25b-410d-9b55-ca71b50627cf"}
As all generation are keep forever, the yaml keep increasing:
status:
exported: "false"
lastCompletedGeneration: 0
phase: running
podNamePrefix: tf-harbor-internet-46ja6izd
stages:
- generation: 1
interruptible: false
podType: setup
reason: TF_RESOURCE_CREATED
startTime: "2022-05-20T11:09:48Z"
state: failed
stopTime: "2022-05-20T11:09:51Z"
...
...
...
- generation: 9897
interruptible: true
podType: post
reason: ""
Thanks @o-orand I knew this would soon be an issue and I haven't thought of a good way to handle it yet. I figured using an existing option, like keepLatestPodsOnly
, should clean up status automatically. The downside is that pod status history is sort of lost... users who log k8s events can see pod statuses.
Other ideas, and possibly one I'll investigate (after kids go back to school 😅 ) is using the PVC to store runner status and terraform logs. This data will be formatted to be fed into a tfo dashboard. More on this to come.
For an immediate fix, perhaps we should keep n number of generation status in case someone is using the generation status feature for some reason. I'll continue forming ideas.
@isaaguilar - Checking up on this thread, is the above workaround still the only way for periodic reconciliation?
Thanks
As a terraform-operator user, In order to ensure tfstate is always in sync with underling infrastructure, and to reduce manual operations, I need a mechanism to automatically and frequently execute the terraform workflow.
Use case samples: