crowdsecurity / crowdsec

CrowdSec - the open-source and participative security solution offering crowdsourced protection against malicious IPs and access to the most advanced real-world CTI.
https://crowdsec.net
MIT License
8.75k stars 451 forks source link

Loki on kubernetes as datasource timeout #3133

Open fgionghi opened 2 months ago

fgionghi commented 2 months ago

What happened?

This is not actually a bug, but due to how Loki operates on Kubernetes, this problem took me several hours to debug. I would like to share my solution to help others avoid the same issue.

As I discovered the first thing CrowdSec does when connecting to Loki is to check if it’s ready via its /ready endpoint. Loki on Kubernetes does not expose the /ready readinessProbe outside the cluster, causing CrowdSec to fail when trying to reach Loki, even though Loki is actually functioning correctly.

The /ready endpoint is not exposed because, by default, Loki on Kubernetes uses loki-gateway, an NGINX server that handles API requests on /loki/api/v1 and basically suppress anything else. In fact directly accessing Loki's pod works for CrowdSec.

To resolve this issue while using the official Helm chart, add the following snippet:

gateway:
  nginxConfig:
    serverSnippet: |-
      location = /ready {
          proxy_pass       http://loki.monitoring.svc.cluster.local:3100$request_uri;
        }

What did you expect to happen?

Since loki was working, I expected crowdsec to be able to reach it without problems. I discovered the issue and that CrowdSec tries /ready before anything else by checking the NGINX logs. It would be helpful if CrowdSec provided more information about this in its logs.

How can we reproduce it (as minimally and precisely as possible)?

Try to reach a loki instance hosted on kubernetes via an ingress

Anything else we need to know?

No response

Crowdsec version

```console $ cscli version version: v1.6.2-debian-pragmatic-amd64-16bfab86 Codename: alphaga BuildDate: 2024-05-31_09:18:01 GoVersion: 1.22.2 Platform: linux libre2: C++ User-Agent: crowdsec/v1.6.2-debian-pragmatic-amd64-16bfab86-linux Constraint_parser: >= 1.0, <= 3.0 Constraint_scenario: >= 1.0, <= 3.0 Constraint_api: v1 Constraint_acquis: >= 1.0, < 2.0 ```

OS version

No response

Enabled collections and parsers

No response

Acquisition config

```console $ cat /etc/crowdsec/acquis.yaml /etc/crowdsec/acquis.yaml --- source: loki log_level: info url: https://loki.mydomain limit: 1000 query: | {host="reverse-proxy"} auth: username: x password: y labels: type: gelf-nginx --- source: loki log_level: debug url: https://loki.mydomain limit: 1000 query: | {unit="ssh.service", instance="ssh-jump"} | json | line_format `{{.SYSLOG_TIMESTAMP}}{{._HOSTNAME}} {{.SYS LOG_IDENTIFIER}}[{{._PID}}]: {{.MESSAGE}}` auth: username: x password: y labels: type: syslog

Config show

No response

Prometheus metrics

No response

Related custom configs versions (if applicable) : notification plugins, custom scenarios, parsers etc.

No response

github-actions[bot] commented 2 months ago

@fgionghi: Thanks for opening an issue, it is currently awaiting triage.

In the meantime, you can:

  1. Check Crowdsec Documentation to see if your issue can be self resolved.
  2. You can also join our Discord.
  3. Check Releases to make sure your agent is on the latest version.
Details I am a bot created to help the [crowdsecurity](https://github.com/crowdsecurity) developers manage community feedback and contributions. You can check out my [manifest file](https://github.com/crowdsecurity/crowdsec/blob/master/.github/governance.yml) to understand my behavior and what I can do. If you want to use this for your project, you can check out the [BirthdayResearch/oss-governance-bot](https://github.com/BirthdayResearch/oss-governance-bot) repository.
LaurenceJJones commented 2 months ago

Linking #2828 for this comment

I discovered the issue and that CrowdSec tries /ready before anything else by checking the NGINX logs. It would be helpful if CrowdSec provided more information about this in its logs.

Maybe a configuration option to disable /ready endpoint check but this is only tailored for k8s ingress. Let us dwell on this, but thank you for your report and guide how others can overcome this.