crowdsecurity / helm-charts

CrowdSec community kubernetes helm charts
MIT License
27 stars 34 forks source link

Potential issues with v1.5.2 on helm #102

Open neilmfrench opened 1 year ago

neilmfrench commented 1 year ago

A clean install of the chart will not work with 1.5.2. Will produce errors trying to chown the sqlite db which does not exist yet.

Upgrading from 1.4.X will allow the app to start, but any cscli command crashes the lapi pod.

LaurenceJJones commented 1 year ago

So if I merged #100 it will have issues can anyone confirm it causes issues?

LaurenceJJones commented 1 year ago

Are you using persistent volumes ? normally this issue can happen when you are mounting a config.yaml

jahanson commented 1 year ago

@LaurenceJJones These values work for me using the chart as it is with 1.4.5. I have not tested anything past that or tried manually setting the chart image.

values:
    container_runtime: containerd
    agent:
      acquisition:
        - namespace: network
      podName: ingress-nginx-controller-*
      program: nginx
      env:
        - name: COLLECTIONS
          value: "crowdsecurity/nginx"
        - name: PARSERS
          value: "crowdsecurity/cri-logs"
jahanson commented 1 year ago

More to the point of the original issue is I don't have any problems with cscli commands and lapi crashing with these almost factory settings.

neilmfrench commented 1 year ago

@LaurenceJJones yes, I am.

Here's my values:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: crowdsec
spec:
  values:
    container_runtime: containerd
    image:
      tag: "v1.4.6"
    tls:
      enabled: true
    secrets:
      username: "${SECRET_CROWDSEC_AGENT_USERNAME}"
      password: "${SECRET_CROWDSEC_AGENT_PASSWORD}"
    config:
      parsers:
        s02-enrich:
          whitelist-external-ip.yaml: |
            name: crowdsecurity/whitelists
            description: "Whitelist my external ip"
            whitelist:
              reason: "My external ip"
              ip:
                - "${SECRET_EXTERNAL_IP}"
                - "${SECRET_GCP_IP}"
    lapi:
      env:
        - name: ENROLL_KEY
          value: "${SECRET_CROWDSEC_ENROLL_KEY}"
        - name: ENROLL_INSTANCE_NAME
          value: "homelab-k8s-cluster"
        - name: ENROLL_TAGS
          value: "k8s linux homelab"
        - name: BOUNCER_KEY_traefik
          value: "${SECRET_CROWDSEC_TRAEFIK_BOUNCER_KEY}"
        - name: BOUNCER_KEY_cloudflare
          value: "${SECRET_CROWDSEC_CLOUDFLARE_BOUNCER_KEY}"
        - name: LEVEL_DEBUG
          value: "true"
      dashboard:
        enabled: true
        assetURL: https://crowdsec-statics-assets.s3-eu-west-1.amazonaws.com/metabase_sqlite.zip
        ingress:
          enabled: true
          annotations:
            cert-manager.io/cluster-issuer: "letsencrypt-production"
            traefik.ingress.kubernetes.io/router.entrypoints: "websecure"
          ingressClassName: "traefik"
          host: &host "crowdsec-dash.${SECRET_DOMAIN}"
          tls:
            - hosts:
                - *host
              secretName: "crowdsec-dash-tls"
      persistentVolume:
        data:
          enabled: true
          storageClassName: ceph-block
          size: 1Gi
        config:
          enabled: true
          storageClassName: ceph-block
          size: 500Mi
    agent:
      persistentVolume:
        config:
          enabled: true
          accessModes:
            - ReadWriteMany
          storageClassName: ceph-filesystem
          size: 400Mi
      env:
        # - name: DISABLE_ONLINE_API
        #   value: "true"
        # - name: USE_FORWARDED_FOR_HEADERS
        #   value: "true"
        - name: COLLECTIONS
          value: >-
            crowdsecurity/linux
            crowdsecurity/sshd
            crowdsecurity/traefik
            crowdsecurity/base-http-scenarios
            crowdsecurity/http-cve
            crowdsecurity/whitelist-good-actors
        - name: PARSERS
          value: "crowdsecurity/cri-logs"
      acquisition:
        - namespace: networking
          podName: traefik-*
          program: traefik
      metrics:
        enabled: true
        serviceMonitor:
          enabled: true

On a completely fresh install (no existing persistent volumes), this will fail if I set tag = v1.5.2. However, with tag = v1.4.6 it works.

I did a little digging when troubleshooting, it seemed to be related to https://github.com/crowdsecurity/crowdsec/blob/master/docker/docker_start.sh#L276 so I disabled the config pvc. That let me get a little further in the startup but it still eventually failed with the same chown error.

My next approach was to do a clean install with v1.4.6 and then upgrade to v1.5.2. This allowed me to get past the chown errors, but when I tried to perform any cscli command, even cscli --help, the lapi pod would crash. I do not have any cscli issues on v1.4.6.

jahanson commented 1 year ago

I did indeed observe the same issues after uninstalling crowdsec and forcing image v1.5.2 so I did a little investigation and I noticed k8s was OOM killing the agents. So I upped the default resource of 100Mi memory limit and i'm no longer getting crashing lapi or agent pods when using the cscli command.

Configuration used:

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: crowdsec
spec:
  interval: 30m
  chart:
    spec:
      chart: crowdsec
      version: 0.9.6
      sourceRef:
        kind: HelmRepository
        name: crowdsec
        namespace: flux-system
      interval: 30m
  values:
    container_runtime: containerd
    image:
      tag: "v1.5.2"
    agent:
      acquisition:
        - namespace: network
          podName: ingress-nginx-controller-*
          program: nginx

      env:
        - name: COLLECTIONS
          value: "crowdsecurity/nginx"
        - name: PARSERS
          value: "crowdsecurity/cri-logs"
      resources:
        limits:
          memory: 512Mi
        requests:
          cpu: 150m
          memory: 256Mi
LaurenceJJones commented 10 months ago

renaming issue due to release of 1.5.3