crowdsecurity / crowdsec

CrowdSec - the open-source and participative security solution offering crowdsourced protection against malicious IPs and access to the most advanced real-world CTI.
https://crowdsec.net
MIT License
8.17k stars 425 forks source link

Slow log ingestion/alerts (up to 1h late) with 1k/s log ingestion #2669

Open laurentgoudet opened 6 months ago

laurentgoudet commented 6 months ago

What happened?

I have a fairly large CrowdSec installation with the agent running on 3 external LBs (nginx hosts), receiving 10M req/hour so around ~1000 req/s each (i.e. not extreme either).

Upon deploying CrowdSec, I had to scale up those hosts to c6in.2xlarge (8 vCPUs) due to the performance hit, but those have been running just fine at ~30% CPU utilization since then.

While I was looking into why CrowdSec did not seem to work as expected, I realized that the alerts fired are late by up to an hour compared to the events time themselves. For instance, I have the following log entry at 00:51 (NY time):

66.70.188.230 - - [18/Dec/2023:00:51:38 -0500] "POST /pcidss/report?set=1&sid=loginchallengeresponse1requestbody&type=allprofiles&username=nsroot HTTP/1.1" 404 23809 "-" "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)" 66.70.188.230, 66.70.188.230, 167.82.166.26, 104.156.83.90 fastly.freelancer.com http server_name="fastly.freelancer.com" server_addr=172.30.13.216 server_port=80 request_id=871a1b689385d79e1ced1e71dbb87a44 request_length=1458 request_time=0.080 proxy_host=varnish_gaf_pool proxy_protocol_addr="-" sent_http_cache_control=(s-maxage=3600) ssl_cipher=- ssl_protocol=- ssl_server_name=- upstream_addr=[172.30.15.203:80] upstream_connect_time=[0.000] upstream_header_time=[0.080] upstream_response_time=[0.080] upstream_status=[404] upstream_http_x_cache=MISS user_id_cookie="" user_id_header="" fastly_request_id="ee3db5a2-653a-42a6-9381-aa088fbd351c" amzn_trace_id="Self=1-657fddea-7f92cee256409bd126915440;Root=ee3db5a2-653a-42a6-9381-aa088fbd351c;AddedBy=Fastly" realip_remote_addr="172.30.6.245" freelancer_app_build_timestamp="-" freelancer_app_locale="-" freelancer_app_name="-"

which shows up in the events list of the following decision:

Ip 66.70.188.230 performed 'crowdsecurity/http-probing' (14 events over 31.342108034s) at 2023-12-18 06:27:38.36585845 +0000 UTC

6:27 UTC is 1:27 NY time (UTC-5), so it took 36 minutes of abuse traffic for that rule to fire.

I initially though it could be due to CrowdSec ingesting the wrong log file (due to log rotation, I am not sure what's CrowdSec behavior when the underlying log file gets rotated), but cscli metrics shows the lines read increasing over time as expected:

$ sudo docker exec -it crowdsec_agent cscli metrics

Acquisition Metrics:
╭─────────────────────────────┬────────────┬──────────────┬────────────────┬────────────────────────╮
│           Source            │ Lines read │ Lines parsed │ Lines unparsed │ Lines poured to bucket │
├─────────────────────────────┼────────────┼──────────────┼────────────────┼────────────────────────┤
│ file:/logs/nginx/access.log │ 239.52M    │ 239.52M      │ 40             │ 129.73M                │
╰─────────────────────────────┴────────────┴──────────────┴────────────────┴────────────────────────╯

Bucket Metrics:
╭─────────────────────────────────────────────┬───────────────┬───────────┬──────────────┬─────────┬─────────╮
│                   Bucket                    │ Current Count │ Overflows │ Instantiated │ Poured  │ Expired │
├─────────────────────────────────────────────┼───────────────┼───────────┼──────────────┼─────────┼─────────┤
│ LePresidente/http-generic-401-bf            │ 1             │ 85        │ 35.32k       │ 50.69k  │ 35.23k  │
│ crowdsecurity/CVE-2019-18935                │ -             │ 15        │ 15           │ -       │ -       │
│ crowdsecurity/CVE-2022-26134                │ -             │ 133       │ 133          │ -       │ -       │
│ crowdsecurity/CVE-2022-35914                │ -             │ 3         │ 3            │ -       │ -       │
│ crowdsecurity/CVE-2022-41082                │ -             │ 6         │ 6            │ -       │ -       │
│ crowdsecurity/CVE-2023-22515                │ -             │ 36        │ 36           │ -       │ -       │
│ crowdsecurity/CVE-2023-22518                │ -             │ 3         │ 3            │ -       │ -       │
│ crowdsecurity/CVE-2023-49103                │ -             │ 12        │ 12           │ -       │ -       │
│ crowdsecurity/apache_log4j2_cve-2021-44228  │ -             │ 63        │ 63           │ -       │ -       │
│ crowdsecurity/f5-big-ip-cve-2020-5902       │ -             │ 26        │ 26           │ -       │ -       │
│ crowdsecurity/fortinet-cve-2018-13379       │ -             │ 9         │ 9            │ -       │ -       │
│ crowdsecurity/fortinet-cve-2022-40684       │ -             │ 4         │ 4            │ -       │ -       │
│ crowdsecurity/http-backdoors-attempts       │ -             │ 3         │ 400          │ 403     │ 397     │
│ crowdsecurity/http-crawl-non_statics        │ 1.66k         │ 271       │ 25.29M       │ 129.08M │ 25.28M  │
│ crowdsecurity/http-cve-2021-41773           │ -             │ 7         │ 7            │ -       │ -       │
│ crowdsecurity/http-path-traversal-probing   │ -             │ 2.34k     │ 3.51k        │ 12.01k  │ 1.17k   │
│ crowdsecurity/http-probing                  │ 82            │ 1.04k     │ 448.26k      │ 542.12k │ 447.13k │
│ crowdsecurity/http-sensitive-files          │ -             │ 83        │ 1.29k        │ 2.17k   │ 1.21k   │
│ crowdsecurity/http-sqli-probbing-detection  │ -             │ -         │ 3.06k        │ 4.13k   │ 3.06k   │
│ crowdsecurity/http-xss-probbing             │ -             │ 1         │ 1.31k        │ 2.43k   │ 1.31k   │
│ crowdsecurity/jira_cve-2021-26086           │ -             │ 20        │ 20           │ -       │ -       │
│ crowdsecurity/spring4shell_cve-2022-22965   │ -             │ 10        │ 10           │ -       │ -       │
│ crowdsecurity/vmware-vcenter-vmsa-2021-0027 │ -             │ 1         │ 1            │ -       │ -       │
│ flnltd/http-bruteforce-login                │ 26            │ 13        │ 18.92k       │ 37.10k  │ 18.88k  │
╰─────────────────────────────────────────────┴───────────────┴───────────┴──────────────┴─────────┴─────────╯

Parser Metrics:
╭──────────────────────────────────┬─────────┬─────────┬──────────╮
│             Parsers              │  Hits   │ Parsed  │ Unparsed │
├──────────────────────────────────┼─────────┼─────────┼──────────┤
│ child-crowdsecurity/http-logs    │ 718.57M │ 679.43M │ 39.14M   │
│ child-crowdsecurity/nginx-logs   │ 239.52M │ 239.52M │ 120      │
│ crowdsecurity/cdn-whitelist      │ 844     │ 844     │ -        │
│ crowdsecurity/dateparse-enrich   │ 239.52M │ 239.52M │ -        │
│ crowdsecurity/geoip-enrich       │ 239.52M │ 239.52M │ -        │
│ crowdsecurity/http-logs          │ 239.52M │ 239.52M │ -        │
│ crowdsecurity/nginx-logs         │ 239.52M │ 239.52M │ 40       │
│ crowdsecurity/non-syslog         │ 239.52M │ 239.52M │ -        │
│ crowdsecurity/rdns               │ 844     │ 844     │ -        │
│ crowdsecurity/seo-bots-whitelist │ 844     │ 844     │ -        │
│ crowdsecurity/whitelists         │ 958.09M │ 958.09M │ -        │
╰──────────────────────────────────┴─────────┴─────────┴──────────╯

Also logrotate is configured for hourly rotations so if that was indeed the issue, all the alerts would be late by at least an hour. For reference those log files grow to up to ~5Gb before being rotated/compressed:

$ ls -lha access.log*
-rw-r----- 1 www-data www-data 1.3G Dec 18 05:35 access.log
-rw-r----- 1 www-data www-data 4.5G Dec 18 05:18 access.log.1
-rw-r----- 1 www-data www-data 585M Dec 18 04:18 access.log.2.gz
-rw-r----- 1 www-data www-data 614M Dec 18 03:18 access.log.3.gz
-rw-r----- 1 www-data www-data 640M Dec 18 02:18 access.log.4.gz
-rw-r----- 1 www-data www-data 592M Dec 18 01:18 access.log.5.gz
-rw-r----- 1 www-data www-data 528M Dec 18 00:17 access.log.6.gz
-rw-r----- 1 www-data www-data 436M Dec 17 23:17 access.log.7.gz
-rw-r----- 1 www-data www-data 391M Dec 17 22:17 access.log.8.gz

Note that I am using a distributed CrowdSec infrastructure here, i.e the agents ingest the logs before they are shipped to ELK, which means essentially that all the leaky buckets rules have x3 the capacity / 1/3 the leak speed. I am fine with that compromise (although having a global way to adjust the rules based on the agents count would be awesome), but despite that the agents seem to have trouble scaling.

I have also disabled crowdsecurity/http-bad-user-agent as I plan to handle those differently, and it was triggering way too frequently for my custom bouncer to scale.

From https://www.crowdsec.net/blog/how-to-process-billions-daily-events-with-crowdsec I have seen that bucket_routines & parser_routines could be tweaked, but the documentation in https://doc.crowdsec.net/docs/next/configuration/crowdsec_configuration is not very helpful on how to set those. Looking at the CPU consumption, it seem that there's ~13 worker processes, none of which looks overloaded:

Screenshot 2023-12-18 at 12 05 27

In short:

What did you expect to happen?

I would expect CrowdSec to ingest logs timely in order to quickly react to abusive traffic.

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

No response

Crowdsec version

```console $ cscli version 2023/12/18 10:49:06 version: v1.5.5-d2d788c5dc0a9e387635276623c6781774a9dfd4 2023/12/18 10:49:06 Codename: alphaga 2023/12/18 10:49:06 BuildDate: 2023-10-24_08:13:35 2023/12/18 10:49:06 GoVersion: 1.21.3 2023/12/18 10:49:06 Platform: docker 2023/12/18 10:49:06 libre2: C++ 2023/12/18 10:49:06 Constraint_parser: >= 1.0, <= 2.0 2023/12/18 10:49:06 Constraint_scenario: >= 1.0, < 3.0 2023/12/18 10:49:06 Constraint_api: v1 2023/12/18 10:49:06 Constraint_acquis: >= 1.0, < 2.0 ```

OS version

```console $ sudo docker exec -it crowdsec_agent cat /etc/os-release NAME="Alpine Linux" ID=alpine VERSION_ID=3.18.4 PRETTY_NAME="Alpine Linux v3.18" HOME_URL="https://alpinelinux.org/" BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues" $ sudo docker exec -it crowdsec_agent uname -a Linux ip-172-30-18-14 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 Linux ```

Enabled collections and parsers

```console $ cscli hub list -o raw crowdsecurity/base-http-scenarios,"enabled,tainted",0.6,http common : scanners detection,collections crowdsecurity/http-cve,enabled,2.5,Detect CVE exploitation in http logs,collections crowdsecurity/linux,enabled,0.2,core linux support : syslog+geoip+ssh,collections crowdsecurity/nginx,"enabled,tainted",0.2,nginx support : parser and generic http scenarios,collections crowdsecurity/sshd,"enabled,update-available",0.2,sshd support : parser and brute-force detection,collections crowdsecurity/whitelist-good-actors,enabled,0.1,Good actors whitelists,collections crowdsecurity/dateparse-enrich,enabled,0.2,,parsers crowdsecurity/geoip-enrich,enabled,0.2,"Populate event with geoloc info : as, country, coords, source range.",parsers crowdsecurity/http-logs,enabled,1.2,"Parse more Specifically HTTP logs, such as HTTP Code, HTTP path, HTTP args and if its a static ressource",parsers crowdsecurity/nginx-logs,enabled,1.5,Parse nginx access and error logs,parsers crowdsecurity/sshd-logs,enabled,2.2,Parse openSSH logs,parsers crowdsecurity/syslog-logs,enabled,0.8,,parsers crowdsecurity/whitelists,enabled,0.2,Whitelist events from private ipv4 addresses,parsers flnltd-gptbot-whitelist.yaml,"enabled,local",n/a,,parsers flnltd-ignore-403s.yaml,"enabled,local",n/a,,parsers flnltd-internal-ips-whitelist.yaml,"enabled,local",n/a,,parsers crowdsecurity/CVE-2019-18935,enabled,0.2,Detect Telerik CVE-2019-18935 exploitation attempts,scenarios crowdsecurity/CVE-2022-26134,enabled,0.2,Detect CVE-2022-26134 exploits,scenarios crowdsecurity/CVE-2022-35914,enabled,0.2,Detect CVE-2022-35914 exploits,scenarios crowdsecurity/CVE-2022-37042,enabled,0.2,Detect CVE-2022-37042 exploits,scenarios crowdsecurity/CVE-2022-40684,enabled,0.3,Detect cve-2022-40684 exploitation attempts,scenarios crowdsecurity/CVE-2022-41082,enabled,0.4,Detect CVE-2022-41082 exploits,scenarios crowdsecurity/CVE-2022-41697,enabled,0.2,Detect CVE-2022-41697 enumeration,scenarios crowdsecurity/CVE-2022-42889,enabled,0.3,Detect CVE-2022-42889 exploits (Text4Shell),scenarios crowdsecurity/CVE-2022-44877,enabled,0.3,Detect CVE-2022-44877 exploits,scenarios crowdsecurity/CVE-2022-46169,enabled,0.2,Detect CVE-2022-46169 brute forcing,scenarios crowdsecurity/CVE-2023-22515,enabled,0.1,Detect CVE-2023-22515 exploitation,scenarios crowdsecurity/CVE-2023-22518,enabled,0.2,Detect CVE-2023-22518 exploits,scenarios crowdsecurity/CVE-2023-49103,enabled,0.2,Detect owncloud CVE-2023-49103 exploitation attempts,scenarios crowdsecurity/apache_log4j2_cve-2021-44228,enabled,0.5,Detect cve-2021-44228 exploitation attemps,scenarios crowdsecurity/f5-big-ip-cve-2020-5902,enabled,0.2,Detect cve-2020-5902 exploitation attemps,scenarios crowdsecurity/fortinet-cve-2018-13379,enabled,0.3,Detect cve-2018-13379 exploitation attemps,scenarios crowdsecurity/grafana-cve-2021-43798,enabled,0.2,Detect cve-2021-43798 exploitation attemps,scenarios crowdsecurity/http-backdoors-attempts,enabled,0.4,Detect attempt to common backdoors,scenarios crowdsecurity/http-crawl-non_statics,enabled,0.4,Detect aggressive crawl from single ip,scenarios crowdsecurity/http-cve-2021-41773,enabled,0.2,cve-2021-41773,scenarios crowdsecurity/http-cve-2021-42013,enabled,0.2,cve-2021-42013,scenarios crowdsecurity/http-generic-bf,enabled,0.5,Detect generic http brute force,scenarios crowdsecurity/http-open-proxy,enabled,0.4,Detect scan for open proxy,scenarios crowdsecurity/http-path-traversal-probing,enabled,0.3,Detect path traversal attempt,scenarios crowdsecurity/http-probing,enabled,0.3,Detect site scanning/probing from a single ip,scenarios crowdsecurity/http-sensitive-files,enabled,0.3,"Detect attempt to access to sensitive files (.log, .db ..) or folders (.git)",scenarios crowdsecurity/http-sqli-probing,enabled,0.3,A scenario that detects SQL injection probing with minimal false positives,scenarios crowdsecurity/http-xss-probing,enabled,0.3,A scenario that detects XSS probing with minimal false positives,scenarios crowdsecurity/jira_cve-2021-26086,enabled,0.2,Detect Atlassian Jira CVE-2021-26086 exploitation attemps,scenarios crowdsecurity/netgear_rce,enabled,0.3,Detect Netgear RCE DGN1000/DGN220 exploitation attempts,scenarios crowdsecurity/nginx-req-limit-exceeded,enabled,0.3,Detects IPs which violate nginx's user set request limit.,scenarios crowdsecurity/pulse-secure-sslvpn-cve-2019-11510,enabled,0.3,Detect cve-2019-11510 exploitation attemps,scenarios crowdsecurity/spring4shell_cve-2022-22965,enabled,0.3,Detect cve-2022-22965 probing,scenarios crowdsecurity/ssh-bf,"enabled,update-available",0.2,Detect ssh bruteforce,scenarios crowdsecurity/ssh-slow-bf,"enabled,update-available",0.3,Detect slow ssh bruteforce,scenarios crowdsecurity/thinkphp-cve-2018-20062,enabled,0.4,Detect ThinkPHP CVE-2018-20062 exploitation attemps,scenarios crowdsecurity/vmware-cve-2022-22954,enabled,0.3,Detect Vmware CVE-2022-22954 exploitation attempts,scenarios crowdsecurity/vmware-vcenter-vmsa-2021-0027,enabled,0.2,Detect VMSA-2021-0027 exploitation attemps,scenarios flnltd-http-bruteforce-login.yaml,"enabled,local",n/a,,scenarios ltsich/http-w00tw00t,enabled,0.2,detect w00tw00t,scenarios crowdsecurity/cdn-whitelist,enabled,0.4,Whitelist CDN providers,postoverflows crowdsecurity/rdns,enabled,0.3,Lookup the DNS associated to the source IP only for overflows,postoverflows crowdsecurity/seo-bots-whitelist,enabled,0.4,Whitelist good search engine crawlers,postoverflows ```

Acquisition config

```console $ sudo docker exec -it crowdsec_agent cat /etc/crowdsec/acquis.yaml /etc/crowdsec/acquis.d/* --- source: file filenames: - "/logs/nginx/access.log" labels: type: nginx cat: can't open '/etc/crowdsec/acquis.d/*': No such file or directory ```

Config show

```console $ cscli config show Global: - Configuration Folder : /etc/crowdsec - Data Folder : /var/lib/crowdsec/data - Hub Folder : /etc/crowdsec/hub - Simulation File : /etc/crowdsec/simulation.yaml - Log Folder : /var/log/ - Log level : info - Log Media : file Crowdsec: - Acquisition File : /etc/crowdsec/acquis.yaml - Parsers routines : 1 - Acquisition Folder : /etc/crowdsec/acquis.d cscli: - Output : human - Hub Branch : - Hub Folder : /etc/crowdsec/hub API Client: - URL : https://127.0.0.1:7671/ - Login : - Credentials File : /etc/crowdsec/local_api_credentials.yaml Local API Server (disabled): - Listen URL : 0.0.0.0:8080 - Profile File : /etc/crowdsec/profiles.yaml - Trusted IPs: - 127.0.0.1 - ::1 - Database: - Type : sqlite - Path : /var/lib/crowdsec/data/crowdsec.db - Flush age : 7d - Flush size : 5000 ```

Prometheus metrics

```console $ cscli metrics Acquisition Metrics: ╭─────────────────────────────┬────────────┬──────────────┬────────────────┬────────────────────────╮ │ Source │ Lines read │ Lines parsed │ Lines unparsed │ Lines poured to bucket │ ├─────────────────────────────┼────────────┼──────────────┼────────────────┼────────────────────────┤ │ file:/logs/nginx/access.log │ 239.52M │ 239.52M │ 40 │ 129.73M │ ╰─────────────────────────────┴────────────┴──────────────┴────────────────┴────────────────────────╯ Bucket Metrics: ╭─────────────────────────────────────────────┬───────────────┬───────────┬──────────────┬─────────┬─────────╮ │ Bucket │ Current Count │ Overflows │ Instantiated │ Poured │ Expired │ ├─────────────────────────────────────────────┼───────────────┼───────────┼──────────────┼─────────┼─────────┤ │ LePresidente/http-generic-401-bf │ 1 │ 85 │ 35.32k │ 50.69k │ 35.23k │ │ crowdsecurity/CVE-2019-18935 │ - │ 15 │ 15 │ - │ - │ │ crowdsecurity/CVE-2022-26134 │ - │ 133 │ 133 │ - │ - │ │ crowdsecurity/CVE-2022-35914 │ - │ 3 │ 3 │ - │ - │ │ crowdsecurity/CVE-2022-41082 │ - │ 6 │ 6 │ - │ - │ │ crowdsecurity/CVE-2023-22515 │ - │ 36 │ 36 │ - │ - │ │ crowdsecurity/CVE-2023-22518 │ - │ 3 │ 3 │ - │ - │ │ crowdsecurity/CVE-2023-49103 │ - │ 12 │ 12 │ - │ - │ │ crowdsecurity/apache_log4j2_cve-2021-44228 │ - │ 63 │ 63 │ - │ - │ │ crowdsecurity/f5-big-ip-cve-2020-5902 │ - │ 26 │ 26 │ - │ - │ │ crowdsecurity/fortinet-cve-2018-13379 │ - │ 9 │ 9 │ - │ - │ │ crowdsecurity/fortinet-cve-2022-40684 │ - │ 4 │ 4 │ - │ - │ │ crowdsecurity/http-backdoors-attempts │ - │ 3 │ 400 │ 403 │ 397 │ │ crowdsecurity/http-crawl-non_statics │ 1.66k │ 271 │ 25.29M │ 129.08M │ 25.28M │ │ crowdsecurity/http-cve-2021-41773 │ - │ 7 │ 7 │ - │ - │ │ crowdsecurity/http-path-traversal-probing │ - │ 2.34k │ 3.51k │ 12.01k │ 1.17k │ │ crowdsecurity/http-probing │ 82 │ 1.04k │ 448.26k │ 542.12k │ 447.13k │ │ crowdsecurity/http-sensitive-files │ - │ 83 │ 1.29k │ 2.17k │ 1.21k │ │ crowdsecurity/http-sqli-probbing-detection │ - │ - │ 3.06k │ 4.13k │ 3.06k │ │ crowdsecurity/http-xss-probbing │ - │ 1 │ 1.31k │ 2.43k │ 1.31k │ │ crowdsecurity/jira_cve-2021-26086 │ - │ 20 │ 20 │ - │ - │ │ crowdsecurity/spring4shell_cve-2022-22965 │ - │ 10 │ 10 │ - │ - │ │ crowdsecurity/vmware-vcenter-vmsa-2021-0027 │ - │ 1 │ 1 │ - │ - │ │ flnltd/http-bruteforce-login │ 26 │ 13 │ 18.92k │ 37.10k │ 18.88k │ ╰─────────────────────────────────────────────┴───────────────┴───────────┴──────────────┴─────────┴─────────╯ Parser Metrics: ╭──────────────────────────────────┬─────────┬─────────┬──────────╮ │ Parsers │ Hits │ Parsed │ Unparsed │ ├──────────────────────────────────┼─────────┼─────────┼──────────┤ │ child-crowdsecurity/http-logs │ 718.57M │ 679.43M │ 39.14M │ │ child-crowdsecurity/nginx-logs │ 239.52M │ 239.52M │ 120 │ │ crowdsecurity/cdn-whitelist │ 844 │ 844 │ - │ │ crowdsecurity/dateparse-enrich │ 239.52M │ 239.52M │ - │ │ crowdsecurity/geoip-enrich │ 239.52M │ 239.52M │ - │ │ crowdsecurity/http-logs │ 239.52M │ 239.52M │ - │ │ crowdsecurity/nginx-logs │ 239.52M │ 239.52M │ 40 │ │ crowdsecurity/non-syslog │ 239.52M │ 239.52M │ - │ │ crowdsecurity/rdns │ 844 │ 844 │ - │ │ crowdsecurity/seo-bots-whitelist │ 844 │ 844 │ - │ │ crowdsecurity/whitelists │ 958.09M │ 958.09M │ - │ ╰──────────────────────────────────┴─────────┴─────────┴──────────╯ ```

Related custom configs versions (if applicable) : notification plugins, custom scenarios, parsers etc.

github-actions[bot] commented 6 months ago

@laurentgoudet: Thanks for opening an issue, it is currently awaiting triage.

In the meantime, you can:

  1. Check Crowdsec Documentation to see if your issue can be self resolved.
  2. You can also join our Discord.
  3. Check Releases to make sure your agent is on the latest version.
Details I am a bot created to help the [crowdsecurity](https://github.com/crowdsecurity) developers manage community feedback and contributions. You can check out my [manifest file](https://github.com/crowdsecurity/crowdsec/blob/master/.github/governance.yml) to understand my behavior and what I can do. If you want to use this for your project, you can check out the [BirthdayResearch/oss-governance-bot](https://github.com/BirthdayResearch/oss-governance-bot) repository.
blotus commented 6 months ago

Hello,

Thanks for the detailed report !

We are going to investigate more on our side, but in theory 1k requests per second is something that should be quite easily handled without specific configuration.

Crowdsec already exposes timing metrics about the parsing stage and the scenarios evaluation (only in prometheus, as we didn't find a good way to display the values in the cli).

You should be able to see them with our grafana dashboards: https://github.com/crowdsecurity/grafana-dashboards/tree/master/dashboards_v5 (more specifically, the Crowdsec Details per Instance, in the Buckets and Parsing sections)

If you have a custom dashboard, the metrics are named:

Was your reproduction also running in docker ? I seem to remember some performance issues a long long time ago with mounting logs in a container, but I cannot find the thread at the moment.

Regarding bucket_routines and parser_routines: the vast majority of the time, crowdsec is not the only service running on the machine, so defaulting to the number of core might introduce some performance impact for the other services. I do agree that the documentation is lacking regarding, we'll have a look to expand a bit more on how/when to change the default values.

laurentgoudet commented 6 months ago

Crowdsec already exposes timing metrics about the parsing stage and the scenarios evaluation (only in prometheus, as we didn't find a good way to display the values in the cli).

You should be able to see them with our grafana dashboards: https://github.com/crowdsecurity/grafana-dashboards/tree/master/dashboards_v5 (more specifically, the Crowdsec Details per Instance, in the Buckets and Parsing sections)

Right thanks, yeah I've deployed those Grafana dashboards, although TBH I am not entirely sure how one is meant to read those heatmaps:

Screenshot 2023-12-19 at 16 50 33 Screenshot 2023-12-19 at 16 50 03

If you have a custom dashboard, the metrics are named:

  • cs_parsing_time_seconds_bucket: there is one time series per instance/datasource/log file
  • cs_bucket_pour_seconds_bucket: there is ont time series per instance/datasource/log file

Looking at those underlying metrics, my understanding is that both the parsing & bucket pouring time is < 5 ms for most buckets?

Screenshot 2023-12-19 at 16 52 04 Screenshot 2023-12-19 at 16 53 04

Was your reproduction also running in docker ? I seem to remember some performance issues a long long time ago with mounting logs in a container, but I cannot find the thread at the moment.

Yes I am running https://hub.docker.com/r/crowdsecurity/crowdsec on the nginx hosts while mounting the log folder into the container, in order to isolate CrowdSec from the rest of the host. The logs do seem to be up-to-date in the container though:

$ date && sudo tail -n1 /mnt/logs/nginx/access.log && sudo docker exec -it crowdsec_agent tail -n1 /logs/nginx/access.log
Tue Dec 19 11:01:21 EST 2023
115.167.50.171 - - [19/Dec/2023:11:01:21 -0500] "GET /api/memberships/0.1/benefits?benefit_names%5B%5D=client_engagement&compact=true&new_errors=true&new_pools=true&webapp=1 HTTP/1.1" 200 202 "https://www.freelancer.com/projects/wordpress/GEO-Directory-Plugin-Configuration/details" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" 115.167.50.171, 115.167.50.171, 104.156.91.22, 104.156.83.39 fastly.freelancer.com http server_name="fastly.freelancer.com" server_addr=172.30.18.14 server_port=80 request_id=f6881bf551ed3739bbe31e8d9f310767 request_length=5133 request_time=0.024 proxy_host=rest_app_common_pool proxy_protocol_addr="-" sent_http_cache_control=(no-store, no-cache) ssl_cipher=- ssl_protocol=- ssl_server_name=- upstream_addr=[172.30.13.101:80] upstream_connect_time=[0.000] upstream_header_time=[0.024] upstream_response_time=[0.024] upstream_status=[200] upstream_http_x_cache=- user_id_cookie="72610449" user_id_header="72610449" fastly_request_id="f559cb2d-cb73-4e4b-91d2-4f695f2837fe" amzn_trace_id="Self=1-6581be51-19f67dd322cda2407b73c2b7;Root=f559cb2d-cb73-4e4b-91d2-4f695f2837fe;AddedBy=Fastly" realip_remote_addr="172.30.3.177" freelancer_app_build_timestamp="1702979179" freelancer_app_locale="en" freelancer_app_name="main"
17.241.219.199 - - [19/Dec/2023:11:01:21 -0500] "GET /ajax-api/language/getLanguages.php?language_codes%5B%5D=zh&compact=true&new_errors=true&new_pools=true HTTP/1.1" 200 210 "-" "Mozilla/5.0 (Linux x64) node.js/16.17.0 v8/9.4.146.26-node.22" 17.241.219.199,17.241.219.199,167.82.171.21,104.156.87.146,17.241.219.199,172.30.18.14,172.30.12.220 www.freelancer.com http server_name="" server_addr=172.30.18.14 server_port=80 request_id=df1fb3350af2302b91d0de240191e44a request_length=578 request_time=2.040 proxy_host=varnish_gaf_pool proxy_protocol_addr="-" sent_http_cache_control=(no-cache, no-store, private) ssl_cipher=- ssl_protocol=- ssl_server_name=- upstream_addr=[172.30.12.220:80] upstream_connect_time=[0.004] upstream_header_time=[2.044] upstream_response_time=[2.044] upstream_status=[200] upstream_http_x_cache=MISS user_id_cookie="" user_id_header="" fastly_request_id="-" amzn_trace_id="-" realip_remote_addr="172.30.19.235" freelancer_app_build_timestamp="-" freelancer_app_locale="zh-CN" freelancer_app_name="main"

I've built a script to query the data that CrowdSec logs into ElasticSearch, in order to compute the time difference between the timestamp of the last event from the events array & the stop_at timestamp. This is what I get:

$ echo Time difference between stop_at and last events[] timestamp && curl -sS "http://localhost:9200/crowdsec-*/_search" -H 'Content-Type: application/json' -d'{ "size": 100, "sort": [{ "stop_at": { "order": "desc" } }] }' | jq -r '.hits.hits[] | [._source.stop_at, (._source.events | last | .timestamp)] | @tsv' | while IFS=$'\t' read -r stop_at last_event_timestamp; do stop_at_epoch=$(date -u -d "$stop_at" +%s); last_event_epoch=$(date -u -d "$last_event_timestamp" +%s); diff=$((stop_at_epoch - last_event_epoch)); hours=$((diff / 3600)); minutes=$(( (diff % 3600) / 60)); seconds=$((diff % 60)); echo "$hours hours, $minutes minutes, and $seconds seconds"; done
Time difference between stop_at and last events[] timestamp
0 hours, 24 minutes, and 16 seconds
0 hours, 54 minutes, and 31 seconds
1 hours, 15 minutes, and 6 seconds
1 hours, 4 minutes, and 46 seconds
1 hours, 4 minutes, and 46 seconds
0 hours, 36 minutes, and 49 seconds
1 hours, 1 minutes, and 16 seconds
0 hours, 52 minutes, and 49 seconds
0 hours, 20 minutes, and 15 seconds
1 hours, 12 minutes, and 45 seconds
1 hours, 8 minutes, and 37 seconds
1 hours, 3 minutes, and 30 seconds
1 hours, 3 minutes, and 1 seconds
0 hours, 53 minutes, and 27 seconds
0 hours, 22 minutes, and 16 seconds
0 hours, 22 minutes, and 14 seconds
0 hours, 20 minutes, and 50 seconds
0 hours, 18 minutes, and 48 seconds
0 hours, 19 minutes, and 32 seconds
0 hours, 16 minutes, and 43 seconds
0 hours, 18 minutes, and 30 seconds
0 hours, 15 minutes, and 45 seconds
0 hours, 16 minutes, and 25 seconds
0 hours, 12 minutes, and 31 seconds
0 hours, 12 minutes, and 24 seconds
1 hours, 9 minutes, and 59 seconds
1 hours, 5 minutes, and 56 seconds
0 hours, 31 minutes, and 50 seconds
0 hours, 55 minutes, and 55 seconds
0 hours, 41 minutes, and 50 seconds
2 hours, 9 minutes, and 4 seconds
0 hours, 34 minutes, and 37 seconds
2 hours, 6 minutes, and 18 seconds
0 hours, 32 minutes, and 9 seconds
0 hours, 32 minutes, and 55 seconds
2 hours, 2 minutes, and 37 seconds
0 hours, 29 minutes, and 7 seconds
1 hours, 59 minutes, and 12 seconds
1 hours, 55 minutes, and 15 seconds
0 hours, 24 minutes, and 21 seconds
1 hours, 49 minutes, and 7 seconds
1 hours, 45 minutes, and 23 seconds
1 hours, 42 minutes, and 21 seconds
1 hours, 39 minutes, and 35 seconds
0 hours, 8 minutes, and 4 seconds
0 hours, 4 minutes, and 17 seconds
1 hours, 35 minutes, and 17 seconds
1 hours, 32 minutes, and 4 seconds
1 hours, 30 minutes, and 2 seconds
1 hours, 28 minutes, and 48 seconds
1 hours, 39 minutes, and 48 seconds
1 hours, 29 minutes, and 23 seconds
1 hours, 25 minutes, and 40 seconds
0 hours, 56 minutes, and 19 seconds
1 hours, 54 minutes, and 49 seconds
1 hours, 18 minutes, and 48 seconds
1 hours, 15 minutes, and 36 seconds
1 hours, 16 minutes, and 2 seconds
1 hours, 14 minutes, and 0 seconds
1 hours, 13 minutes, and 41 seconds
1 hours, 14 minutes, and 16 seconds
1 hours, 12 minutes, and 6 seconds
1 hours, 10 minutes, and 10 seconds
1 hours, 11 minutes, and 38 seconds
1 hours, 11 minutes, and 17 seconds
1 hours, 11 minutes, and 3 seconds
1 hours, 10 minutes, and 17 seconds
1 hours, 9 minutes, and 50 seconds
1 hours, 8 minutes, and 43 seconds
1 hours, 6 minutes, and 54 seconds
1 hours, 6 minutes, and 44 seconds
1 hours, 8 minutes, and 8 seconds
1 hours, 6 minutes, and 2 seconds
1 hours, 7 minutes, and 27 seconds
1 hours, 5 minutes, and 43 seconds
1 hours, 5 minutes, and 16 seconds
1 hours, 4 minutes, and 36 seconds
1 hours, 4 minutes, and 19 seconds
1 hours, 3 minutes, and 57 seconds
1 hours, 33 minutes, and 54 seconds
1 hours, 16 minutes, and 15 seconds
1 hours, 3 minutes, and 57 seconds
1 hours, 3 minutes, and 18 seconds
1 hours, 23 minutes, and 13 seconds
1 hours, 20 minutes, and 56 seconds
1 hours, 19 minutes, and 31 seconds
0 hours, 49 minutes, and 39 seconds
1 hours, 17 minutes, and 32 seconds
1 hours, 16 minutes, and 5 seconds
1 hours, 43 minutes, and 20 seconds
1 hours, 10 minutes, and 51 seconds
1 hours, 10 minutes, and 39 seconds
1 hours, 8 minutes, and 31 seconds
1 hours, 7 minutes, and 20 seconds
1 hours, 7 minutes, and 19 seconds
1 hours, 3 minutes, and 58 seconds
1 hours, 1 minutes, and 28 seconds
1 hours, 0 minutes, and 45 seconds
0 hours, 57 minutes, and 22 seconds
0 hours, 55 minutes, and 19 seconds

In short the lag seems to grow over time until it is being "reset" (e.g. "only" 4 minutes, and 17 seconds for an event)?

blotus commented 6 months ago

Did you modify the charts yourself ? I'm a little confused by the Y axis, it's supposed to be the buckets we use for the histogram (for example, for the parsing time, it should go from 0.0 to 0.01 in roughly 0.001 increments), something like this:

image

(It might also be due to your grafana version, my screenshot has been made on 10.2.3)

The values on the left make me think that the option Calculate from data is set to Yes in the panel (in which case, grafana will try to compute the bucket for the histogram itself, but it's already done by crowdsec/prometheus). To check/change this, edit the panel, and on the menu on the right, go to Heatmap, you should see the option.

In short the lag seems to grow over time until it is being "reset" (e.g. "only" 4 minutes, and 17 seconds for an event)?

That's very odd. Crowdsec does not have a mechanism to drop events internally (ie, once a log line has been read from a file, crowdsec will keep until we were able to send it through the parsers and scenarios), I don't really see how the processing would be magically sped up at some point (if crowdsec somehow falls behind due to the volume of logs, unless there's a huge drop in the number of logs per seconds, it's quite unlikely it will ever catch up).

laurentgoudet commented 6 months ago

Did you modify the charts yourself ? I'm a little confused by the Y axis, it's supposed to be the buckets we use for the histogram (for example, for the parsing time, it should go from 0.0 to 0.01 in roughly 0.001 increments), something like this:

image

(It might also be due to your grafana version, my screenshot has been made on 10.2.3)

The values on the left make me think that the option Calculate from data is set to Yes in the panel (in which case, grafana will try to compute the bucket for the histogram itself, but it's already done by crowdsec/prometheus). To check/change this, edit the panel, and on the menu on the right, go to Heatmap, you should see the option.

Right, I am on v9.0.1, which is not particularity old, but there is no Calculate from data option & looking at the release notes it seems to have been introduced in v9.1 (https://grafana.com/docs/grafana/latest/whatsnew/whats-new-in-v9-1/#heatmap-improvements) 🙃.

I was able to switch to the "new (beta)" Heatmap panel, and the heatmaps are now rendering fine - thanks! The parsing time one would suggest that some events have a > 10 ms < +Inf parse time:

Screenshot 2023-12-20 at 18 49 12 Screenshot 2023-12-20 at 18 48 30

In short the lag seems to grow over time until it is being "reset" (e.g. "only" 4 minutes, and 17 seconds for an event)?

That's very odd. Crowdsec does not have a mechanism to drop events internally (ie, once a log line has been read from a file, crowdsec will keep until we were able to send it through the parsers and scenarios), I don't really see how the processing would be magically sped up at some point (if crowdsec somehow falls behind due to the volume of logs, unless there's a huge drop in the number of logs per seconds, it's quite unlikely it will ever catch up).

Does the size of the log file matter? i.e. could it be that when those 5Gb log files are reset to 0 on log rotation, events are then processed much faster (not sure why if parsing time is fine)? I'll tweak my ES query script to see if I can figure out the interval of those "reset".

buixor commented 6 months ago

hopefully https://github.com/crowdsecurity/crowdsec/issues/2687 explains that too :)

laurentgoudet commented 6 months ago

hopefully #2687 explains that too :)

Unfortunately it did not 😞. Looking at one of the last alert for instance, it fired at 15:16:29 UTC+1:

Screenshot 2024-01-05 at 17 08 41

When the corresponding request happened at 14:01:42 UTC+1:

access.log.4.gz:104.225.142.141 - - [05/Jan/2024:08:01:42 -0500] "GET /concrete/vendor/htmlawed/htmlawed/htmLawedTest.php HTTP/1.1" 404 24762 "-" "python-requests/2.31.0" 104.225.142.141, 104.225.142.141, 157.52.64.68, 167.82.233.86 fastly.freelancer.com http server_name="fastly.freelancer.com" server_addr=172.30.14.20 server_port=80 request_id=04e9a4fb7c3866fa2ef0ad3fd00a1b60 request_length=1316 request_time=2.284 proxy_host=varnish_gaf_pool proxy_protocol_addr="-" sent_http_cache_control=(s-maxage=3600) ssl_cipher=- ssl_protocol=- ssl_server_name=- upstream_addr=[172.30.18.46:80] upstream_connect_time=[0.000] upstream_header_time=[2.284] upstream_response_time=[2.284] upstream_status=[404] upstream_http_x_cache=MISS user_id_cookie="" user_id_header="" fastly_request_id="24820394-dd40-4b18-a691-09a6f933bad5" amzn_trace_id="Self=1-6597fdb4-25e088647b135d7a313bc11c;Root=24820394-dd40-4b18-a691-09a6f933bad5;AddedBy=Fastly" realip_remote_addr="172.30.11.25" freelancer_app_build_timestamp="-" freelancer_app_locale="-" freelancer_app_name="-"
{
  "meta": [
    {
      "key": "ASNNumber",
      "value": "54913"
    },
    {
      "key": "ASNOrg",
      "value": "CLOUDWEBMANAGE-CANADA"
    },
    {
      "key": "IsInEU",
      "value": "false"
    },
    {
      "key": "IsoCode",
      "value": "CA"
    },
    {
      "key": "SourceRange",
      "value": "104.225.142.0/23"
    },
    {
      "key": "datasource_path",
      "value": "/logs/nginx/access.log"
    },
    {
      "key": "datasource_type",
      "value": "file"
    },
    {
      "key": "http_args_len",
      "value": "0"
    },
    {
      "key": "http_path",
      "value": "/concrete/vendor/htmlawed/htmlawed/htmLawedTest.php"
    },
    {
      "key": "http_status",
      "value": "404"
    },
    {
      "key": "http_user_agent",
      "value": "python-requests/2.31.0"
    },
    {
      "key": "http_verb",
      "value": "GET"
    },
    {
      "key": "log_type",
      "value": "http_access-log"
    },
    {
      "key": "service",
      "value": "http"
    },
    {
      "key": "source_ip",
      "value": "104.225.142.141"
    },
    {
      "key": "timestamp",
      "value": "2024-01-05T08:01:42-05:00"
    }
  ],
  "timestamp": "2024-01-05T08:01:42-05:00"
}

The event timestamp on the CrowdSec alert logs is the correct one, i.e. 08:01:42-05:00 is 14:01:42 UTC+1. In that specific instance it took ~1h15 minutes between the event being logged & the alert firing.

The log appears on line 2772796 in the log file, while the log file was 3832897 lines (4.7Gb) before being rotated. I'm going to try to fiddle with a couple of things, but as all the CrowdSec metrics seem to be fine I am not really sure where this could be coming from.

buixor commented 6 months ago

Hello,

I wanted to follow up because I'm still puzzled :)

I've spun up a similar instance on AWS, and the default setup (1 go routine) took around 5 seconds to parse 5k lines of nginx access logs, down to a bit less than 3 seconds with re2_grok_support enabled.

default setup with 1 parser routine

INFO[09-01-2024 13:38:29] reading ./nginx5k.log at once                 type="file://./nginx5k.log"
...
WARN[09-01-2024 13:38:34] Acquisition is finished, shutting down

And increasing the amount of go routines dedicated to parsers (up to ~core-1) should allow nearly linear scaling.

You can test it using the replay mode feature (crowdsec -dsn file:///tmp/nginx5k.log -type nginx -no-api) : it's going to start a crowdsec without a local API, parse the provided nginx access log (sending alerts to the already running local API) and shut down once done. Can you try a similar operation on your side to see the results you're getting? As well, providing pprof output can help if you get really abnormal results.

Once we solved the core issue, there are a few things we can look at to ensure it scales properly:

laurentgoudet commented 6 months ago

I've spun up a similar instance on AWS, and the default setup (1 go routine) took around 5 seconds to parse 5k lines of nginx access logs, down to a bit less than 3 seconds with re2_grok_support enabled.

default setup with 1 parser routine

INFO[09-01-2024 13:38:29] reading ./nginx5k.log at once                 type="file://./nginx5k.log"
...
WARN[09-01-2024 13:38:34] Acquisition is finished, shutting down

With the default setup I get:

INFO[09-01-2024 15:33:26] Loaded 44 scenarios
INFO[09-01-2024 15:33:26] Adding file /logs/nginx/access.log.1.5k to filelist  type="file:///logs/nginx/access.log.1.5k"
INFO[09-01-2024 15:33:26] Starting processing data
INFO[09-01-2024 15:33:26] reading /logs/nginx/access.log.1.5k at once   type="file:///logs/nginx/access.log.1.5k"
WARN[09-01-2024 15:33:26] prometheus: listen tcp 0.0.0.0:6060: bind: address already in use
WARN[09-01-2024 15:33:36] Acquisition is finished, shutting down
INFO[09-01-2024 15:33:37] Killing parser routines
INFO[09-01-2024 15:33:38] Bucket routine exiting
INFO[09-01-2024 15:33:39] crowdsec shutdown

i.e. it takes 10s on an c6in.2xlarge with a dorkerized CrowdSec.

At the moment the access logs on the hosts are rotated hourly & grow to ~5Gb and 4M lines before rotation. If it scales linearly with the size of the log file, 4M * 10s / 5k = >2 hours of parsing time, although I would assume CrowdSec to just tail the log and not read everything from the start on each new line?

Using sudo docker exec -it crowdsec_agent bash -c 'CROWDSEC_FEATURE_RE2_GROK_SUPPORT=true crowdsec -dsn file:///logs/nginx/access.log.1.5k -type nginx -no-api' I can indeed see the acquisition time being cut down to ~3s:

INFO[09-01-2024 15:41:46] Loaded 44 scenarios
INFO[09-01-2024 15:41:46] Adding file /logs/nginx/access.log.1.5k to filelist  type="file:///logs/nginx/access.log.1.5k"
INFO[09-01-2024 15:41:46] Starting processing data
INFO[09-01-2024 15:41:46] reading /logs/nginx/access.log.1.5k at once   type="file:///logs/nginx/access.log.1.5k"
WARN[09-01-2024 15:41:46] prometheus: listen tcp 0.0.0.0:6060: bind: address already in use
WARN[09-01-2024 15:41:50] Acquisition is finished, shutting down
INFO[09-01-2024 15:41:50] Killing parser routines
INFO[09-01-2024 15:41:52] Bucket routine exiting
INFO[09-01-2024 15:41:53] crowdsec shutdown

On a 4M of lines log file this would mean a 40 minutes parsing time.

And increasing the amount of go routines dedicated to parsers (up to ~core-1) should allow nearly linear scaling.

On a 5k log file I don't see a noticeable difference with 7 parser routines vs. 1 (still 3s of parsing time). Trying on a 100k log file, I indeed get the exact same parsing time (~1m) with 7 parser routines vs. 1.

With parser_routines: 1:

INFO[09-01-2024 15:49:35] Loaded 44 scenarios
INFO[09-01-2024 15:49:35] Adding file /logs/nginx/access.log.1.100k to filelist  type="file:///logs/nginx/access.log.1.100k"
INFO[09-01-2024 15:49:35] Starting processing data
INFO[09-01-2024 15:49:35] reading /logs/nginx/access.log.1.100k at once  type="file:///logs/nginx/access.log.1.100k"
WARN[09-01-2024 15:49:35] prometheus: listen tcp 0.0.0.0:6060: bind: address already in use
INFO[09-01-2024 15:49:38] 9 existing buckets
INFO[09-01-2024 15:49:41] 24 existing buckets
INFO[09-01-2024 15:49:44] 36 existing buckets
INFO[09-01-2024 15:49:48] 52 existing buckets
INFO[09-01-2024 15:49:51] 64 existing buckets
INFO[09-01-2024 15:49:54] 78 existing buckets
INFO[09-01-2024 15:49:57] 88 existing buckets
INFO[09-01-2024 15:50:00] 98 existing buckets
INFO[09-01-2024 15:50:03] 102 existing buckets
INFO[09-01-2024 15:50:06] 112 existing buckets
INFO[09-01-2024 15:50:09] 122 existing buckets
INFO[09-01-2024 15:50:12] 136 existing buckets
INFO[09-01-2024 15:50:16] 146 existing buckets
INFO[09-01-2024 15:50:19] 153 existing buckets
INFO[09-01-2024 15:50:22] 163 existing buckets
INFO[09-01-2024 15:50:25] 171 existing buckets
INFO[09-01-2024 15:50:28] 188 existing buckets
INFO[09-01-2024 15:50:31] 201 existing buckets
INFO[09-01-2024 15:50:34] 208 existing buckets
WARN[09-01-2024 15:50:36] Acquisition is finished, shutting down
INFO[09-01-2024 15:50:37] Killing parser routines
INFO[09-01-2024 15:50:38] Bucket routine exiting
INFO[09-01-2024 15:50:39] crowdsec shutdown

With parser_routines: 7:

INFO[09-01-2024 15:47:39] Adding file /logs/nginx/access.log.1.100k to filelist  type="file:///logs/nginx/access.log.1.100k"
INFO[09-01-2024 15:47:39] Starting processing data
INFO[09-01-2024 15:47:39] reading /logs/nginx/access.log.1.100k at once  type="file:///logs/nginx/access.log.1.100k"
WARN[09-01-2024 15:47:39] prometheus: listen tcp 0.0.0.0:6060: bind: address already in use
INFO[09-01-2024 15:47:43] 9 existing buckets
INFO[09-01-2024 15:47:46] 24 existing buckets
INFO[09-01-2024 15:47:49] 36 existing buckets
INFO[09-01-2024 15:47:52] 52 existing buckets
INFO[09-01-2024 15:47:55] 64 existing buckets
INFO[09-01-2024 15:47:58] 78 existing buckets
INFO[09-01-2024 15:48:01] 88 existing buckets
INFO[09-01-2024 15:48:04] 98 existing buckets
INFO[09-01-2024 15:48:07] 102 existing buckets
INFO[09-01-2024 15:48:10] 112 existing buckets
INFO[09-01-2024 15:48:13] 122 existing buckets
INFO[09-01-2024 15:48:16] 136 existing buckets
INFO[09-01-2024 15:48:20] 146 existing buckets
INFO[09-01-2024 15:48:23] 153 existing buckets
INFO[09-01-2024 15:48:26] 163 existing buckets
INFO[09-01-2024 15:48:29] 171 existing buckets
INFO[09-01-2024 15:48:32] 188 existing buckets
INFO[09-01-2024 15:48:35] 201 existing buckets
INFO[09-01-2024 15:48:38] 208 existing buckets
WARN[09-01-2024 15:48:40] Acquisition is finished, shutting down
INFO[09-01-2024 15:48:41] Killing parser routines
INFO[09-01-2024 15:48:41] Killing parser routines
INFO[09-01-2024 15:48:41] Killing parser routines
INFO[09-01-2024 15:48:41] Killing parser routines
INFO[09-01-2024 15:48:41] Killing parser routines
INFO[09-01-2024 15:48:41] Killing parser routines
INFO[09-01-2024 15:48:41] Killing parser routines
INFO[09-01-2024 15:48:42] Bucket routine exiting
INFO[09-01-2024 15:48:43] crowdsec shutdow

Looking at docker stats the container seems to be using a bit more CPU with 7 parser routines despite the parsing time being basically the same.

With parser_routines: 1:

Screenshot 2024-01-09 at 16 54 21

With parser_routines: 7:

Screenshot 2024-01-09 at 16 53 07

As well, providing pprof output can help if you get really abnormal results.

I couldn't find an easy way to profile the replays as the service itself is already binding to port 6060, and there's no environment flag to override that. Here's are the capture I've took on the live traffic though:

pprof.crowdsec.samples.cpu.002.pb.gz

pprof.crowdsec.goroutine.003.pb.gz

In short what's CrowdSec behavior with large logs files? is it really re-reading everything? Or is the issue specific to those files being Docker binds?

buixor commented 6 months ago

In short what's CrowdSec behavior with large logs files? is it really re-reading everything? Or is the issue specific to those files being Docker binds?

Crowdsec tails the file, and file size shouldn't have an impact on performance (we're using https://github.com/nxadm/tail at the back).

we've done a few benchmarks following your last feedback and it seems that with the RE2 addition, buckets are now the bottleneck (used to be parsers) and increasing buckets_routines is more efficient.

Note: tests are done on the same instance as previously, with a 100k lines nginx log file that triggers a lot of scenarios (with the full nginx collections), and RE enabled:

config duration EP/s
1 parser routine + 1 bucket routine 47 2127 EP/s
2 parser routines + 4 buckets routines 25 4000 EP/s

We have done similar tests within docker and didn't notice any significant performance impact. We're trying to do a few tests to see if it might be related to docker binds.

laurentgoudet commented 6 months ago

Thanks, your comment made me realize that I had 1k EP/s which is likely higher that what the default config can handle, which could explain why the alerts are firing that late.

I've turned on the re2_grok_support feature flag on a canary host in production & I can indeed confirm that that was it, i.e. the behavior I was seeing was due to the parsers not being able to keep up with the event volume.

On deploy, the acquisition rate doubled, now matching our actual EP/s (graph rate interval is 1 min, so ~60k matches ~1k/s):

Screenshot 2024-01-10 at 16 46 03

Interestingly, the unparsed lines increased from 0 to a couple, as if the graph was not working beforehand:

Screenshot 2024-01-10 at 16 46 17

The parsing heat map is still a bit hard to read due to a number of events falling into the +Inf bucket, but now shows that most events are parsed timely. It feels like having larger buckets would help here, although I guess if CrowdSec does not see/parse those events they won't appear anywhere?

Screenshot 2024-01-10 at 16 46 28

Bucket pour time is looking nice as well:

Screenshot 2024-01-10 at 16 58 50

And most importantly alerts are immediately raised on events, as one would expect:

$ curl -sS "http://localhost:9200/crowdsec-*/_search" -H 'Content-Type: application/json' -d'{ "size": 50, "sort": [{ "stop_at": { "order": "desc" } }] }' | jq -r '.hits.hits[] | [._source.stop_at, (._source.events | last | .timestamp)] | @tsv' | while IFS=$'\t' read -r stop_at last_event_timestamp; do stop_at_epoch=$(date -u -d "$stop_at" +%s); last_event_epoch=$(date -u -d "$last_event_timestamp" +%s); diff=$((stop_at_epoch - last_event_epoch)); hours=$((diff / 3600)); minutes=$(( (diff % 3600) / 60)); seconds=$((diff % 60)); echo "$stop_at: $hours hours, $minutes minutes, and $seconds seconds"; done
2024-01-10T15:37:34.597269052Z: 0 hours, 0 minutes, and 0 seconds
2024-01-10T15:35:29.012566824Z: 0 hours, 0 minutes, and 1 seconds
2024-01-10T15:32:14.357986823Z: 0 hours, 0 minutes, and 0 seconds
2024-01-10T15:15:40.646338896Z: 0 hours, 0 minutes, and 0 seconds
2024-01-10T15:13:18.135517384Z: 0 hours, 0 minutes, and 0 seconds
2024-01-10T15:13:13.931118796Z: 0 hours, 0 minutes, and 0 seconds
2024-01-10T15:09:50.436328392Z: 0 hours, 0 minutes, and 0 seconds
2024-01-10T14:58:05.06419677Z: 0 hours, 0 minutes, and 0 seconds
2024-01-10T14:52:38.733460913Z: 0 hours, 0 minutes, and 0 seconds
2024-01-10T14:50:43.979029111Z: 0 hours, 0 minutes, and 0 seconds
2024-01-10T14:30:03.199038588Z: 0 hours, 0 minutes, and 0 seconds
2024-01-10T13:38:45.44187652Z: 1 hours, 24 minutes, and 40 seconds
2024-01-10T13:34:54.6068261Z: 1 hours, 22 minutes, and 34 seconds
2024-01-10T12:45:57.846061249Z: 0 hours, 55 minutes, and 55 seconds
2024-01-10T12:45:08.101527956Z: 0 hours, 55 minutes, and 29 seconds
2024-01-10T11:34:48.859707498Z: 1 hours, 18 minutes, and 9 seconds
2024-01-10T11:02:17.555276131Z: 1 hours, 0 minutes, and 55 seconds
2024-01-10T10:38:39.076570509Z: 0 hours, 48 minutes, and 12 seconds
2024-01-10T09:42:34.953922781Z: 0 hours, 17 minutes, and 38 seconds
2024-01-10T09:35:56.205774173Z: 0 hours, 14 minutes, and 7 seconds
2024-01-10T08:57:48.185270026Z: 0 hours, 54 minutes, and 55 seconds
2024-01-10T08:54:52.643617507Z: 0 hours, 53 minutes, and 34 seconds
2024-01-10T08:41:58.934394767Z: 0 hours, 47 minutes, and 26 seconds

In the end it looks like that CrowdSec was running in a degraded mode, which was bit unintuitive as it was still firing some alerts (albeit very late). Having said that I am not entirely sure what I would have expected the failure mode to be when the event rate is higher that what the agent can currently handle, as crashing/bailing out on an event spike/traffic increase is likely not desirable as well.

As for why the alert lag seemed to randomly "reset", as our event rate is fairly constant across the day, it feels like it could be due to an internal queue/ring buffer (managed by https://github.com/nxadm/tail or else) that rolls over once in a while, explaining why CrowdSec was not leaking memory despite consuming only half of the logs?


As a remediation, is there a way for me to detect that failure mode through the Prometheus metrics so that I can setup an alert for it? it seems like cs_parsing_time_seconds_bucket cannot help much, outside checking if X events are being parsed under 0.0005s:

Screenshot 2024-01-10 at 17 05 46

I could build a similar alert with cs_filesource_hits_total but that would rely on a magic number as well.