Open fabszabo opened 6 months ago
Flux should never go into a complete deadlock just because one thing doesn't work.
Flux by default doesn't do that, at fault here is probably the way you configured it. Unless you provide your whole configuration and explain when the timeout occurs, I don't see how anyone can help you.
I can't copy paste the configs for a few reasons, but I'll prepare something asap. The issue is definitely real.
Here's the infrastructure. I'll let you be the judge but I really don't think there is anything wrong with the configuration. I believe that I'm encountering an edge case that isn't covered yet. It runs fine on 2 clusters with 40+ services, just causes this issue when the image is not available "in time" (like 5-10 mins.) Once the "context deadline" is exceeded that's it. It's stuck until I remove the service from the repository and re-add it. Which causes a downtime of 1-2 mins for our clients.
In hindsight I have to apologize for the wording above, as I have since noticed that I am still able to update and deploy other things and it still catches on to new releases of the "broken" service. But it does not catch on to it running fine in the cluster regardless of how long I wait.
I have followed the guide on how to set up the flux config repository. Directory overview:
├── infrastructure
│ └── controllers
│ ├── cluster-a
│ └── cluster-b
├── clusters
│ ├── cluster-a
│ │ └── flux-system
│ └── cluster-b
│ └── flux-system
└── apps
├── base
│ ├── service-a
│ └── service-b
├── cluster-a
└── cluster-b
apps/base/service/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- repository.yaml
- release.yaml
- git-deploy-key.yaml
apps/base/service/repository.yaml
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
name: service
namespace: default
spec:
interval: 5m0s
url: ssh://git@github.com:22/company/service
ref:
branch: master
secretRef:
name: service-git-deploy-key
apps/base/service/release.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: service
namespace: default
spec:
releaseName: service
chart:
spec:
chart: chart
sourceRef:
kind: GitRepository
name: service
namespace: default
interval: 5m
apps/base/service/git-deploy-key.yaml
---
apiVersion: v1
kind: Secret
metadata:
name: service-git-deploy-key
namespace: default
type: Opaque
stringData:
identity: |
...
apps/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../base/service.yaml
patchesStrategicMerge:
- patches.yaml
apps/production/patches.yaml
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: service
namespace: default
spec:
chart:
spec:
chart: charts/production
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
name: service
namespace: default
spec:
ref:
branch: master
---
clusters/production/flux-system -> standard 2 autogenerated files from flux
clusters/production/apps.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps
namespace: flux-system
spec:
interval: 10m0s
dependsOn:
- name: infra-controllers
sourceRef:
kind: GitRepository
name: flux-system
path: ./apps/production
prune: true
wait: true
timeout: 5m0s
clusters/production/infrastructure.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: infra-controllers
namespace: flux-system
spec:
interval: 1h
retryInterval: 1m
timeout: 5m
sourceRef:
kind: GitRepository
name: flux-system
path: ./infrastructure/controllers/production
prune: true
wait: true
infrastructure/controllers/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- weave-gitops.yaml
infrastructure/controllers/production/weave-gitops.yaml
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: weave-gitops
namespace: flux-system
spec:
type: oci
interval: 24h
url: oci://ghcr.io/weaveworks/charts
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: weave-gitops
namespace: flux-system
spec:
interval: 60m
chart:
spec:
chart: weave-gitops
version: "*"
sourceRef:
kind: HelmRepository
name: weave-gitops
interval: 12h
values:
resources:
requests:
cpu: 100m
memory: 64Mi
limits:
cpu: 1
memory: 512Mi
securityContext:
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
adminUser:
create: true
username: ...
passwordHash: ...
Once the "context deadline" is exceeded that's it. It's stuck until I remove the service from the repository and re-add it.
What is a service in this context, is it some Deployment inside a Helm chart?
Describe the bug
Our build workflow first increments the version of our services, then builds the new version. This works fine for over 30 services, except 1 that takes a bit longer to build.
Steps to reproduce
Update a chart version. Wait until "context deadline exceeded" Upload the docker-image for that chart version. Kubernetes Cluster catches on and uses the new image Flux gets stuck and refuses to apply any new changes
Expected behavior
Flux should notice that the new version is available and already running on the cluster.
Flux should never go into a complete deadlock just because one thing doesn't work.
Screenshots and recordings
No response
OS / Distro
N/A
Flux version
flux: v2.2.2
Flux check
► checking prerequisites ✗ flux 2.2.2 <2.2.3 (new CLI version is available, please upgrade) ✔ Kubernetes 1.27.11-gke.1062000 >=1.26.0-0 ► checking version in cluster ✔ distribution: flux-v2.2.2 ✔ bootstrapped: true ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.37.2 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v1.2.1 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v1.2.3 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v1.2.3 ► checking crds ✔ alerts.notification.toolkit.fluxcd.io/v1beta3 ✔ buckets.source.toolkit.fluxcd.io/v1beta2 ✔ gitrepositories.source.toolkit.fluxcd.io/v1 ✔ helmcharts.source.toolkit.fluxcd.io/v1beta2 ✔ helmreleases.helm.toolkit.fluxcd.io/v2beta2 ✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1 ✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2 ✔ providers.notification.toolkit.fluxcd.io/v1beta3 ✔ receivers.notification.toolkit.fluxcd.io/v1 ✔ all checks passed
Git provider
GitHub
Container Registry provider
Google Cloud
Additional context
No response
Code of Conduct