CrowdSec stuck/leaking buckets when scope value is invalid (alert rejected by LAPI)

laurentgoudet commented 8 months ago

What happened?

Our edge load balancers ran out of memory, due to the CrowdSec process memory usage growing unbounded:

Upon looking into why, it seems that the crowdsecurity/http-crawl-non_statics buckets count started to creep up:

The bucket creation rate increased a bit at the time due to an ongoing attack:

The attack was a brute force attack against our login endpoint, so the IPs were not being reused and crowdsecurity/http-crawl-non_statics has a short leak speed (0.5s), so the underflow rate closely tracks the creation rate:

Despite that it looks like as if the underflowed buckets were not being cleaned up as the total count just keeps increasing through the roof.

The issue was not limited to that specific scenario - all buckets started to not being cleaned up anymore, including the ones that were not triggered by the attack:

In short, it feels like CrowdSec was "stuck" cleaning up underflowed buckets. The weird thing is that cs_bucket_underflowed_total reports those are underflowed, but cs_buckets keeps increasing.

Nothing in the CrowdSec logs during the event/when expired buckets started not to be cleaned up anymore. Also the attack triggered ~2k alerts, which does not seem that many to me.

The same issue happened 3 times during the last 2 days so we had to turn off CrowdSec for now. Any idea of what might be going on/how to debug it further?

What did you expect to happen?

CrowdSec to survive attacks

How can we reproduce it (as minimally and precisely as possible)?

Not sure what triggers the behavior

Anything else we need to know?

No response

Crowdsec version

```console $ cscli version 2023/12/18 10:49:06 version: v1.5.5-d2d788c5dc0a9e387635276623c6781774a9dfd4 2023/12/18 10:49:06 Codename: alphaga 2023/12/18 10:49:06 BuildDate: 2023-10-24_08:13:35 2023/12/18 10:49:06 GoVersion: 1.21.3 2023/12/18 10:49:06 Platform: docker 2023/12/18 10:49:06 libre2: C++ 2023/12/18 10:49:06 Constraint_parser: >= 1.0, <= 2.0 2023/12/18 10:49:06 Constraint_scenario: >= 1.0, < 3.0 2023/12/18 10:49:06 Constraint_api: v1 2023/12/18 10:49:06 Constraint_acquis: >= 1.0, < 2.0 ```

OS version

```console $ sudo docker exec -it crowdsec_agent cat /etc/os-release NAME="Alpine Linux" ID=alpine VERSION_ID=3.18.4 PRETTY_NAME="Alpine Linux v3.18" HOME_URL="https://alpinelinux.org/" BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues" $ sudo docker exec -it crowdsec_agent uname -a Linux ip-172-30-18-14 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 Linux ```

Enabled collections and parsers

```console $ cscli hub list -o raw crowdsecurity/base-http-scenarios,"enabled,tainted",0.6,http common : scanners detection,collections crowdsecurity/http-cve,enabled,2.5,Detect CVE exploitation in http logs,collections crowdsecurity/linux,enabled,0.2,core linux support : syslog+geoip+ssh,collections crowdsecurity/nginx,"enabled,tainted",0.2,nginx support : parser and generic http scenarios,collections crowdsecurity/sshd,"enabled,update-available",0.2,sshd support : parser and brute-force detection,collections crowdsecurity/whitelist-good-actors,enabled,0.1,Good actors whitelists,collections crowdsecurity/dateparse-enrich,enabled,0.2,,parsers crowdsecurity/geoip-enrich,enabled,0.2,"Populate event with geoloc info : as, country, coords, source range.",parsers crowdsecurity/http-logs,enabled,1.2,"Parse more Specifically HTTP logs, such as HTTP Code, HTTP path, HTTP args and if its a static ressource",parsers crowdsecurity/nginx-logs,enabled,1.5,Parse nginx access and error logs,parsers crowdsecurity/sshd-logs,enabled,2.2,Parse openSSH logs,parsers crowdsecurity/syslog-logs,enabled,0.8,,parsers crowdsecurity/whitelists,enabled,0.2,Whitelist events from private ipv4 addresses,parsers flnltd-gptbot-whitelist.yaml,"enabled,local",n/a,,parsers flnltd-ignore-403s.yaml,"enabled,local",n/a,,parsers flnltd-internal-ips-whitelist.yaml,"enabled,local",n/a,,parsers crowdsecurity/CVE-2019-18935,enabled,0.2,Detect Telerik CVE-2019-18935 exploitation attempts,scenarios crowdsecurity/CVE-2022-26134,enabled,0.2,Detect CVE-2022-26134 exploits,scenarios crowdsecurity/CVE-2022-35914,enabled,0.2,Detect CVE-2022-35914 exploits,scenarios crowdsecurity/CVE-2022-37042,enabled,0.2,Detect CVE-2022-37042 exploits,scenarios crowdsecurity/CVE-2022-40684,enabled,0.3,Detect cve-2022-40684 exploitation attempts,scenarios crowdsecurity/CVE-2022-41082,enabled,0.4,Detect CVE-2022-41082 exploits,scenarios crowdsecurity/CVE-2022-41697,enabled,0.2,Detect CVE-2022-41697 enumeration,scenarios crowdsecurity/CVE-2022-42889,enabled,0.3,Detect CVE-2022-42889 exploits (Text4Shell),scenarios crowdsecurity/CVE-2022-44877,enabled,0.3,Detect CVE-2022-44877 exploits,scenarios crowdsecurity/CVE-2022-46169,enabled,0.2,Detect CVE-2022-46169 brute forcing,scenarios crowdsecurity/CVE-2023-22515,enabled,0.1,Detect CVE-2023-22515 exploitation,scenarios crowdsecurity/CVE-2023-22518,enabled,0.2,Detect CVE-2023-22518 exploits,scenarios crowdsecurity/CVE-2023-49103,enabled,0.2,Detect owncloud CVE-2023-49103 exploitation attempts,scenarios crowdsecurity/apache_log4j2_cve-2021-44228,enabled,0.5,Detect cve-2021-44228 exploitation attemps,scenarios crowdsecurity/f5-big-ip-cve-2020-5902,enabled,0.2,Detect cve-2020-5902 exploitation attemps,scenarios crowdsecurity/fortinet-cve-2018-13379,enabled,0.3,Detect cve-2018-13379 exploitation attemps,scenarios crowdsecurity/grafana-cve-2021-43798,enabled,0.2,Detect cve-2021-43798 exploitation attemps,scenarios crowdsecurity/http-backdoors-attempts,enabled,0.4,Detect attempt to common backdoors,scenarios crowdsecurity/http-crawl-non_statics,enabled,0.4,Detect aggressive crawl from single ip,scenarios crowdsecurity/http-cve-2021-41773,enabled,0.2,cve-2021-41773,scenarios crowdsecurity/http-cve-2021-42013,enabled,0.2,cve-2021-42013,scenarios crowdsecurity/http-generic-bf,enabled,0.5,Detect generic http brute force,scenarios crowdsecurity/http-open-proxy,enabled,0.4,Detect scan for open proxy,scenarios crowdsecurity/http-path-traversal-probing,enabled,0.3,Detect path traversal attempt,scenarios crowdsecurity/http-probing,enabled,0.3,Detect site scanning/probing from a single ip,scenarios crowdsecurity/http-sensitive-files,enabled,0.3,"Detect attempt to access to sensitive files (.log, .db ..) or folders (.git)",scenarios crowdsecurity/http-sqli-probing,enabled,0.3,A scenario that detects SQL injection probing with minimal false positives,scenarios crowdsecurity/http-xss-probing,enabled,0.3,A scenario that detects XSS probing with minimal false positives,scenarios crowdsecurity/jira_cve-2021-26086,enabled,0.2,Detect Atlassian Jira CVE-2021-26086 exploitation attemps,scenarios crowdsecurity/netgear_rce,enabled,0.3,Detect Netgear RCE DGN1000/DGN220 exploitation attempts,scenarios crowdsecurity/nginx-req-limit-exceeded,enabled,0.3,Detects IPs which violate nginx's user set request limit.,scenarios crowdsecurity/pulse-secure-sslvpn-cve-2019-11510,enabled,0.3,Detect cve-2019-11510 exploitation attemps,scenarios crowdsecurity/spring4shell_cve-2022-22965,enabled,0.3,Detect cve-2022-22965 probing,scenarios crowdsecurity/ssh-bf,"enabled,update-available",0.2,Detect ssh bruteforce,scenarios crowdsecurity/ssh-slow-bf,"enabled,update-available",0.3,Detect slow ssh bruteforce,scenarios crowdsecurity/thinkphp-cve-2018-20062,enabled,0.4,Detect ThinkPHP CVE-2018-20062 exploitation attemps,scenarios crowdsecurity/vmware-cve-2022-22954,enabled,0.3,Detect Vmware CVE-2022-22954 exploitation attempts,scenarios crowdsecurity/vmware-vcenter-vmsa-2021-0027,enabled,0.2,Detect VMSA-2021-0027 exploitation attemps,scenarios flnltd-http-bruteforce-login.yaml,"enabled,local",n/a,,scenarios ltsich/http-w00tw00t,enabled,0.2,detect w00tw00t,scenarios crowdsecurity/cdn-whitelist,enabled,0.4,Whitelist CDN providers,postoverflows crowdsecurity/rdns,enabled,0.3,Lookup the DNS associated to the source IP only for overflows,postoverflows crowdsecurity/seo-bots-whitelist,enabled,0.4,Whitelist good search engine crawlers,postoverflows ```

Acquisition config

```console $ sudo docker exec -it crowdsec_agent cat /etc/crowdsec/acquis.yaml /etc/crowdsec/acquis.d/* --- source: file filenames: - "/logs/nginx/access.log" labels: type: nginx cat: can't open '/etc/crowdsec/acquis.d/*': No such file or directory ```

Config show

```console $ cscli metrics Acquisition Metrics: ╭─────────────────────────────┬────────────┬──────────────┬────────────────┬────────────────────────╮ │ Source │ Lines read │ Lines parsed │ Lines unparsed │ Lines poured to bucket │ ├─────────────────────────────┼────────────┼──────────────┼────────────────┼────────────────────────┤ │ file:/logs/nginx/access.log │ 239.52M │ 239.52M │ 40 │ 129.73M │ ╰─────────────────────────────┴────────────┴──────────────┴────────────────┴────────────────────────╯ Bucket Metrics: ╭─────────────────────────────────────────────┬───────────────┬───────────┬──────────────┬─────────┬─────────╮ │ Bucket │ Current Count │ Overflows │ Instantiated │ Poured │ Expired │ ├─────────────────────────────────────────────┼───────────────┼───────────┼──────────────┼─────────┼─────────┤ │ LePresidente/http-generic-401-bf │ 1 │ 85 │ 35.32k │ 50.69k │ 35.23k │ │ crowdsecurity/CVE-2019-18935 │ - │ 15 │ 15 │ - │ - │ │ crowdsecurity/CVE-2022-26134 │ - │ 133 │ 133 │ - │ - │ │ crowdsecurity/CVE-2022-35914 │ - │ 3 │ 3 │ - │ - │ │ crowdsecurity/CVE-2022-41082 │ - │ 6 │ 6 │ - │ - │ │ crowdsecurity/CVE-2023-22515 │ - │ 36 │ 36 │ - │ - │ │ crowdsecurity/CVE-2023-22518 │ - │ 3 │ 3 │ - │ - │ │ crowdsecurity/CVE-2023-49103 │ - │ 12 │ 12 │ - │ - │ │ crowdsecurity/apache_log4j2_cve-2021-44228 │ - │ 63 │ 63 │ - │ - │ │ crowdsecurity/f5-big-ip-cve-2020-5902 │ - │ 26 │ 26 │ - │ - │ │ crowdsecurity/fortinet-cve-2018-13379 │ - │ 9 │ 9 │ - │ - │ │ crowdsecurity/fortinet-cve-2022-40684 │ - │ 4 │ 4 │ - │ - │ │ crowdsecurity/http-backdoors-attempts │ - │ 3 │ 400 │ 403 │ 397 │ │ crowdsecurity/http-crawl-non_statics │ 1.66k │ 271 │ 25.29M │ 129.08M │ 25.28M │ │ crowdsecurity/http-cve-2021-41773 │ - │ 7 │ 7 │ - │ - │ │ crowdsecurity/http-path-traversal-probing │ - │ 2.34k │ 3.51k │ 12.01k │ 1.17k │ │ crowdsecurity/http-probing │ 82 │ 1.04k │ 448.26k │ 542.12k │ 447.13k │ │ crowdsecurity/http-sensitive-files │ - │ 83 │ 1.29k │ 2.17k │ 1.21k │ │ crowdsecurity/http-sqli-probbing-detection │ - │ - │ 3.06k │ 4.13k │ 3.06k │ │ crowdsecurity/http-xss-probbing │ - │ 1 │ 1.31k │ 2.43k │ 1.31k │ │ crowdsecurity/jira_cve-2021-26086 │ - │ 20 │ 20 │ - │ - │ │ crowdsecurity/spring4shell_cve-2022-22965 │ - │ 10 │ 10 │ - │ - │ │ crowdsecurity/vmware-vcenter-vmsa-2021-0027 │ - │ 1 │ 1 │ - │ - │ │ flnltd/http-bruteforce-login │ 26 │ 13 │ 18.92k │ 37.10k │ 18.88k │ ╰─────────────────────────────────────────────┴───────────────┴───────────┴──────────────┴─────────┴─────────╯ Parser Metrics: ╭──────────────────────────────────┬─────────┬─────────┬──────────╮ │ Parsers │ Hits │ Parsed │ Unparsed │ ├──────────────────────────────────┼─────────┼─────────┼──────────┤ │ child-crowdsecurity/http-logs │ 718.57M │ 679.43M │ 39.14M │ │ child-crowdsecurity/nginx-logs │ 239.52M │ 239.52M │ 120 │ │ crowdsecurity/cdn-whitelist │ 844 │ 844 │ - │ │ crowdsecurity/dateparse-enrich │ 239.52M │ 239.52M │ - │ │ crowdsecurity/geoip-enrich │ 239.52M │ 239.52M │ - │ │ crowdsecurity/http-logs │ 239.52M │ 239.52M │ - │ │ crowdsecurity/nginx-logs │ 239.52M │ 239.52M │ 40 │ │ crowdsecurity/non-syslog │ 239.52M │ 239.52M │ - │ │ crowdsecurity/rdns │ 844 │ 844 │ - │ │ crowdsecurity/seo-bots-whitelist │ 844 │ 844 │ - │ │ crowdsecurity/whitelists │ 958.09M │ 958.09M │ - │ ╰──────────────────────────────────┴─────────┴─────────┴──────────╯ ```

Prometheus metrics

Can't do since it crashed, but see Prometheus graphs above.

Related custom configs versions (if applicable) : notification plugins, custom scenarios, parsers etc.

No response

github-actions[bot] commented 8 months ago

@laurentgoudet: Thanks for opening an issue, it is currently awaiting triage.

In the meantime, you can:

Check Crowdsec Documentation to see if your issue can be self resolved.
You can also join our Discord.
Check Releases to make sure your agent is on the latest version.

Details

I am a bot created to help the [crowdsecurity](https://github.com/crowdsecurity) developers manage community feedback and contributions. You can check out my [manifest file](https://github.com/crowdsecurity/crowdsec/blob/master/.github/governance.yml) to understand my behavior and what I can do. If you want to use this for your project, you can check out the [BirthdayResearch/oss-governance-bot](https://github.com/BirthdayResearch/oss-governance-bot) repository.

LaurenceJJones commented 8 months ago

Could you provide the configuration file? I know you had a previous issue so it would be interesting to know if you increase the routines including bucket routines.

Edit: It also might help for us to know what type of applications you are proxying via nginx, because there may be some applications that causes a lot of log lines (web socket, long/short polling) these log lines will cause shorter life buckets to stay alive

buixor commented 8 months ago

Hello @laurentgoudet !

(thanks again for the very detailed issue)

Looking at what you said, specifically:

The weird thing is that cs_bucket_underflowed_total reports those are underflowed, but cs_buckets keeps increasing.

And looking at the code, the cs_buckets metrics gets decreased when the routine exits while the underflow one gets increased a bit before. What might look plausible is that the routine gets stuck trying to send the underflow information back (ie. leaky.AllOut is congested), however, this sounds quite unusual.

How many parser/buckets/output routines do you have? (output_routines would be the one in charge of consuming the leaky.AllOut channel). I suppose for this to happen, you might either need to produce a massive amount of alerts, and/or have postoverflows that do DNS resolutions or something expensive (that would slow down the ability to empty this queue).

edit: @LaurenceJJones told me that it seems from your other discussion that you are using reverse dns postoverflows, and unfortunately this mechanism doesn't have any cache and will be super duper slow - especially if you generate a lot of events. Can you try to disable it for now? It sounds like a valid suspect.

laurentgoudet commented 8 months ago

Could you provide the configuration file? I know you had a previous issue so it would be interesting to know if you increase the routines including bucket routines.

Edit: It also might help for us to know what type of applications you are proxying via nginx, because there may be some applications that causes a lot of log lines (web socket, long/short polling) these log lines will cause shorter life buckets to stay alive

Sure. I haven't done a lot of customization yet, so the whole config is basically (environment-specific certs are omitted):

crowdsec_agent:
  image: 'crowdsecurity/crowdsec'
  image_tag: 'v1.5.5'
  environment:
    LEVEL_INFO: true
    COLLECTIONS: 'crowdsecurity/nginx crowdsecurity/whitelist-good-actors'
    PARSERS: 'crowdsecurity/geoip-enrich crowdsecurity/dateparse-enrich'
    DISABLE_PARSERS: 'crowdsecurity/cri-logs crowdsecurity/docker-logs'
    DISABLE_SCENARIOS: 'crowdsecurity/http-bad-user-agent'
    DISABLE_LOCAL_API: true
    LOCAL_API_URL: "https://127.0.0.1:7671"
    USE_TLS: true
    CACERT_FILE: /etc/crowdsec/ca_bundle.pem
    CLIENT_CERT_FILE: /etc/crowdsec/agent.pem
    CLIENT_KEY_FILE: /etc/crowdsec/agent-key.pem
  files:
    /etc/crowdsec/config.yaml.local:
      format: 'yaml'
      content:
        common:
          log_media: 'file'
          log_dir: /var/log/
          log_max_size: 10
          log_max_age: 7
          log_max_files: 3
          compress_logs: true
    /etc/crowdsec/acquis.yaml:
      format: 'yaml'
      content:
        source: 'file'
        filenames:
          - '/logs/nginx/access.log'
        labels:
          type: 'nginx'
    /staging/etc/crowdsec/parsers/s02-enrich/flnltd-internal-ips-whitelist.yaml:
      format: 'yaml'
      content:
        name: crowdsecurity/whitelists
        description: Whitelist events from internal IP addresses
        whitelist:
          reason: internal ip ranges
          expression:
          - |
            evt.Meta.source_ip in [
              <redacted, list in internal NAT gateways>,
            ] && evt.Parsed.http_user_agent != 'crowdsec'
    /staging/etc/crowdsec/parsers/s02-enrich/flnltd-ignore-403s.yaml:
      format: 'yaml'
      content:
        name: crowdsecurity/whitelists
        description: Ignore 403s
        whitelist:
          reason: ignore 403s as we use that status code for blocked sources
          expression:
          - evt.Parsed.status == '403'
    /staging/etc/crowdsec/scenarios/flnltd-http-bruteforce-login.yaml:
      format: 'yaml'
      content:
        type: leaky
        name: flnltd/http-bruteforce-login
        description: "Detect advanced brute-force patterns on the login endpoint"
        filter: "evt.Parsed.request == '<redacted>' && evt.Parsed.status == '401'"
        groupby: evt.Meta.SourceRange
        capacity: 10
        leakspeed: "1m"
        blackhole: 10m
        labels:
          service: http
          type: bruteforce
          remediation: true
        scope:
          type: Range
          expression: evt.Meta.SourceRange
    /staging/etc/crowdsec/parsers/s02-enrich/flnltd-gptbot-whitelist.yaml:
      format: 'yaml'
      content:
        name: crowdsecurity/whitelists
        description: "Whitelist OpenAI GPTBot"
        whitelist:
          reason: "OpenAI GPTbot triggers crowdsecurity/http-probing when crawling the site"
          expression:
            - "any(File('gptbot-ranges.txt'), { len(#) > 0 && IpInRange(evt.Meta.source_ip ,#)})"
        data:
          - source_url: https://openai.com/gptbot-ranges.txt
            dest_file: gptbot-ranges.txt
            type: string
    # This is needed as for now the data external source above is only used when the parser is installed from the hub
    # https://docs.crowdsec.net/docs/next/parsers/format/#data
    /var/lib/crowdsec/data/gptbot-ranges.txt:
      format: 'raw'
      content: |
        52.230.152.0/24
        52.233.106.0/24
  volumes:
    - guest: '/logs/nginx:ro'
      host: '/mnt/logs/nginx'
    - guest: '/var/log'
      host: '/mnt/logs'
  ports:
    # Prometheus metrics
    - host: 6060
      guest: 6060

How many parser/buckets/output routines do you have? (output_routines would be the one in charge of consuming the leaky.AllOut channel). I suppose for this to happen, you might either need to produce a massive amount of alerts, and/or have postoverflows that do DNS resolutions or something expensive (that would slow down the ability to empty this queue).

I have not customized output_routines, so it'd be the default value.

edit: @LaurenceJJones told me that it seems from your other discussion that you are using reverse dns postoverflows, and unfortunately this mechanism doesn't have any cache and will be super duper slow - especially if you generate a lot of events. Can you try to disable it for now? It sounds like a valid suspect.

Yes I'm using crowdsecurity/whitelist-good-actors with uses rdns. I could switch to whitelisting public IP ranges though (e.g. https://developers.google.com/static/search/apis/ipranges/googlebot.json, https://www.bing.com/toolbox/bingbot.json..), since I can understand how running RDNS queries at the edge is not ideal.

Having said that looking at overflows during the incident their number were still very low, i.e. if I am reading that thing right it peaked at 1 overflow every 10s (per host):

If I assume that cs_bucket_overflowed_total contains the overflows later discarded by crowdsecurity/whitelist-good-actors (since it's called "post" overflow), then the number of reverse dns queries seem really minimal, i.e. I am not sure if it's the good suspect?

The other issue I am facing with the deployment is https://github.com/crowdsecurity/crowdsec/issues/2669 so possibly both are somehow related, although not sure how.

laurentgoudet commented 8 months ago

Nothing in the CrowdSec logs during the event/when expired buckets started not to be cleaned up anymore. Also the attack triggered ~2k alerts, which does not seem that many to me.

Actually spoke too soon, by bad. Looking back at the CrowdSec logs, it seems that the issue comes from my custom flnltd/http-bruteforce-login scenario which does:

scope:
  type: Range
  expression: evt.Meta.SourceRange

For some reason - probably when the entry is missing from the GeoLite2 DB - the source range can be empty, as per the Range performed 'flnltd/http-bruteforce-login' (11 events over 40.366500614s) at 2023-12-21 23:29:25.593189181 +0000 UTC logs below, which align with our External LB going out of memory:

When that happens, the agents seems be retrying to send the alert to the LAPI a large number of times (100000?), which I am guessing causes the output goroutines to be stuck, as per the stuck for 148.510521ms sending event to 2355c624f9a2271765356bf0b9c1726d81215572 (sigclosed:1 failed_sent:99998 attempts:100000) logs below:

I guess the 100000 retry figure might be a bit extreme? In the meantime I can probably work around the issue with filter: evt.Meta.SourceRange - I couldn't pin down which IP triggers that though.

buixor commented 8 months ago

Seems to be a good suspect. Will try to reproduce on our side and keep you posted :)

laurentgoudet commented 8 months ago

In the meantime I can probably work around the issue with filter: evt.Meta.SourceRange

I've deployed that workaround on a canary host & CrowdSec has been running fine so far, although most of the bad actors still seem to be on holiday 🏖️.

I'll keep monitoring it but that GH issue can probably be closed now, as the root cause is tracked in https://github.com/crowdsecurity/crowdsec/issues/2687.

crowdsecurity / crowdsec