crowdsecurity / helm-charts

CrowdSec community kubernetes helm charts
MIT License
27 stars 34 forks source link

[Question] Trouble with HA for LAPI Pod #181

Closed ImranR98 closed 1 week ago

ImranR98 commented 2 weeks ago

I've been trying to get this to work in a small testing environment with Traefik. My current config seems to work fine with a single LAPI pod backed by a Postgres DB and connected to 2 agents on 2 nodes.

But if I try setting the lapi.replicas value to 2, I get the following error in one of the two pods when I try to run a cscli command (like cscli decisions list): level=fatal msg="unable to retrieve decisions: performing request: Get \"http://localhost:8080/v1/alerts?has_active_decision=true&include_capi=false&limit=100\": API error: incorrect Username or Password" command terminated with exit code 1

This is my values.yaml:

config:
  config.yaml.local: |
    db_config:
      type:     postgresql
      user:     ${DB_USERNAME}
      password: ${DB_PASSWORD}
      db_name:  ${DB_NAME}
      host:     crowdsec-db.production.svc.cluster.local
      port:     5432
      sslmode:  disable

container_runtime: containerd

agent:
  acquisition:
    - namespace: production
      podName: traefik-*
      program: traefik
  env:
    - name: COLLECTIONS
      value: "crowdsecurity/traefik"
    - name: LEVEL_DEBUG
      value: "false"

lapi:
  replicas: 2 # Seems to not work with multiple replicas
  dashboard:
    enabled: true
  env:
    - name: BOUNCER_KEY_traefik
      value: "<some long value>"
    - name: DB_NAME
      valueFrom:
        secretKeyRef:
          name: crowdsec-db-secret
          key: POSTGRES_DB
    - name: DB_USERNAME
      valueFrom:
        secretKeyRef:
          name: crowdsec-db-secret
          key: POSTGRES_USER
    - name: DB_PASSWORD
      valueFrom:
        secretKeyRef:
          name: crowdsec-db-secret
          key: POSTGRES_PASSWORD
  persistentVolume:
    config:
      enabled: false
    data:
      enabled: false
  secrets:
    csLapiSecret: "<some long value>" # I set this to try and fix the issue (it didn't)

My assumption was that since I have disabled persistent volumes and configured a DB instead, both LAPI instances would connect to the same DB and have no issues. But I've clearly misunderstood how everything fits together. Would appreciate anyone pointing me in the right direction!

github-actions[bot] commented 2 weeks ago

@ImranR98: Thanks for opening an issue, it is currently awaiting triage.

If you haven't already, please provide the following information:

In the meantime, you can:

  1. Check Crowdsec Documentation to see if your issue can be self resolved.
  2. You can also join our Discord.
  3. Check Releases to make sure your agent is on the latest version.
Details I am a bot created to help the [crowdsecurity](https://github.com/crowdsecurity) developers manage community feedback and contributions. You can check out my [manifest file](https://github.com/crowdsecurity/helm-charts/blob/main/.github/governance.yaml) to understand my behavior and what I can do. If you want to use this for your project, you can check out the forked project [rr404/oss-governance-bot](https://github.com/rr404/oss-governance-bot) repository.
github-actions[bot] commented 2 weeks ago

@ImranR98: There are no 'kind' label on this issue. You need a 'kind' label to start the triage process.

Details I am a bot created to help the [crowdsecurity](https://github.com/crowdsecurity) developers manage community feedback and contributions. You can check out my [manifest file](https://github.com/crowdsecurity/helm-charts/blob/main/.github/governance.yaml) to understand my behavior and what I can do. If you want to use this for your project, you can check out the forked project [rr404/oss-governance-bot](https://github.com/rr404/oss-governance-bot) repository.
ImranR98 commented 2 weeks ago

/kind documentation /area local-api

he2ss commented 2 weeks ago

Hi, the solution is to check in the chart if the replica is enabled ( more than 1) then add suffix the env var CUSTOM_HOSTNAME with an index.

Discussed with @blotus.

ImranR98 commented 2 weeks ago

I'm not sure I understand, but glad to see there's a PR to fix it :rocket: Just to clarify, does this mean that - even without the PR you made - Crowdsec is actually working as expected aside from cscli availability? I assumed the lack of cscli access meant there was something else wrong with the pod.

LaurenceJJones commented 2 weeks ago

I'm not sure I understand, but glad to see there's a PR to fix it 🚀 Just to clarify, does this mean that - even without the PR you made - Crowdsec is actually working as expected aside from cscli availability? I assumed the lack of cscli access meant there was something else wrong with the pod.

So a not so tldr;

When the LAPI pods come up because they need to have working credentials they execute a direct machine add command and by default the container choose the name "localhost" as by the default value for CUSTOM_HOSTNAME. Since both LAPI's are using the same name within the startup script they delete the previous LAPI credentials that were just registered (because it believes itself to be unique and if the name already exists it thinks that the LAPI pod has been deleted and the credentials have been lost) , hence why you have one LAPI that works with cscli and another that does not.

The side effect is that one of the LAPIs will work for a couple of hours due to the JWT token being valid and once the token expires that LAPI will start to get authentication errors since the previously registered username and password does now not exist within the database.

The fix, we now force each LAPI to have a unique name by using the pod metadata of the randomly generated name, this will stop the name collision.

ImranR98 commented 1 week ago

Okay that makes sense, thanks for the explanation!