fluxcd / flux2

Open and extensible continuous delivery solution for Kubernetes. Powered by GitOps Toolkit.
https://fluxcd.io
Apache License 2.0
6.56k stars 607 forks source link

Flux deadlocked, all resources stalled but everything fine in cluster #4752

Open fabszabo opened 6 months ago

fabszabo commented 6 months ago

Describe the bug

Our build workflow first increments the version of our services, then builds the new version. This works fine for over 30 services, except 1 that takes a bit longer to build.

Steps to reproduce

Update a chart version. Wait until "context deadline exceeded" Upload the docker-image for that chart version. Kubernetes Cluster catches on and uses the new image Flux gets stuck and refuses to apply any new changes

Expected behavior

Flux should notice that the new version is available and already running on the cluster.

Flux should never go into a complete deadlock just because one thing doesn't work.

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

flux: v2.2.2

Flux check

► checking prerequisites ✗ flux 2.2.2 <2.2.3 (new CLI version is available, please upgrade) ✔ Kubernetes 1.27.11-gke.1062000 >=1.26.0-0 ► checking version in cluster ✔ distribution: flux-v2.2.2 ✔ bootstrapped: true ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.37.2 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v1.2.1 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v1.2.3 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v1.2.3 ► checking crds ✔ alerts.notification.toolkit.fluxcd.io/v1beta3 ✔ buckets.source.toolkit.fluxcd.io/v1beta2 ✔ gitrepositories.source.toolkit.fluxcd.io/v1 ✔ helmcharts.source.toolkit.fluxcd.io/v1beta2 ✔ helmreleases.helm.toolkit.fluxcd.io/v2beta2 ✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1 ✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2 ✔ providers.notification.toolkit.fluxcd.io/v1beta3 ✔ receivers.notification.toolkit.fluxcd.io/v1 ✔ all checks passed

Git provider

GitHub

Container Registry provider

Google Cloud

Additional context

No response

Code of Conduct

stefanprodan commented 6 months ago

Flux should never go into a complete deadlock just because one thing doesn't work.

Flux by default doesn't do that, at fault here is probably the way you configured it. Unless you provide your whole configuration and explain when the timeout occurs, I don't see how anyone can help you.

fabszabo commented 6 months ago

I can't copy paste the configs for a few reasons, but I'll prepare something asap. The issue is definitely real.

fabszabo commented 6 months ago

Here's the infrastructure. I'll let you be the judge but I really don't think there is anything wrong with the configuration. I believe that I'm encountering an edge case that isn't covered yet. It runs fine on 2 clusters with 40+ services, just causes this issue when the image is not available "in time" (like 5-10 mins.) Once the "context deadline" is exceeded that's it. It's stuck until I remove the service from the repository and re-add it. Which causes a downtime of 1-2 mins for our clients.

In hindsight I have to apologize for the wording above, as I have since noticed that I am still able to update and deploy other things and it still catches on to new releases of the "broken" service. But it does not catch on to it running fine in the cluster regardless of how long I wait.

I have followed the guide on how to set up the flux config repository. Directory overview:

├── infrastructure
│   └── controllers 
│       ├── cluster-a 
│       └── cluster-b 
├── clusters
│   ├── cluster-a
│   │   └── flux-system
│   └── cluster-b
│       └── flux-system
└── apps
    ├── base
    │   ├── service-a
    │   └── service-b
    ├── cluster-a
    └── cluster-b

apps/base

apps/base/service/kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - repository.yaml
  - release.yaml
  - git-deploy-key.yaml

apps/base/service/repository.yaml

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: service
  namespace: default
spec:
  interval: 5m0s
  url: ssh://git@github.com:22/company/service
  ref:
    branch: master
  secretRef:
    name: service-git-deploy-key

apps/base/service/release.yaml

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: service
  namespace: default
spec:
  releaseName: service
  chart:
    spec:
      chart: chart
      sourceRef:
        kind: GitRepository
        name: service
        namespace: default
  interval: 5m

apps/base/service/git-deploy-key.yaml

---
apiVersion: v1
kind: Secret
metadata:
  name: service-git-deploy-key
  namespace: default
type: Opaque
stringData:
  identity: |
  ...

apps/production

apps/production/kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../base/service.yaml
patchesStrategicMerge:
  - patches.yaml

apps/production/patches.yaml

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: service
  namespace: default
spec:
  chart:
    spec:
      chart: charts/production
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: service
  namespace: default
spec:
  ref:
    branch: master
---

clusters/production

clusters/production/flux-system -> standard 2 autogenerated files from flux

clusters/production/apps.yaml

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps
  namespace: flux-system
spec:
  interval: 10m0s
  dependsOn:
    - name: infra-controllers
  sourceRef:
    kind: GitRepository
    name: flux-system
  path: ./apps/production
  prune: true
  wait: true
  timeout: 5m0s

clusters/production/infrastructure.yaml

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: infra-controllers
  namespace: flux-system
spec:
  interval: 1h
  retryInterval: 1m
  timeout: 5m
  sourceRef:
    kind: GitRepository
    name: flux-system
  path: ./infrastructure/controllers/production
  prune: true
  wait: true

infrastructure/controllers/production

infrastructure/controllers/production/kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - weave-gitops.yaml

infrastructure/controllers/production/weave-gitops.yaml

---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: weave-gitops
  namespace: flux-system
spec:
  type: oci
  interval: 24h
  url: oci://ghcr.io/weaveworks/charts
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: weave-gitops
  namespace: flux-system
spec:
  interval: 60m
  chart:
    spec:
      chart: weave-gitops
      version: "*"
      sourceRef:
        kind: HelmRepository
        name: weave-gitops
      interval: 12h
  values:
    resources:
      requests:
        cpu: 100m
        memory: 64Mi
      limits:
        cpu: 1
        memory: 512Mi
    securityContext:
      capabilities:
        drop:
          - ALL
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 1000
    adminUser:
      create: true
      username: ...
      passwordHash: ...
stefanprodan commented 6 months ago

Once the "context deadline" is exceeded that's it. It's stuck until I remove the service from the repository and re-add it.

What is a service in this context, is it some Deployment inside a Helm chart?