High CPU on Multi-Server Setup

lleddewk commented 2 years ago

What happened?

I have a 4 node multi-server setup. All nodes are VPS linked by a wireguard VPN connection. After updating to version crowdsec 1.4.0 using the debian repo I am seeing consistently high CPU usage on the LAPI node. The 3 satellite nodes all remain at 1% to 5% cpu usage but the LAPI node ranges from 50% to 100%.

I have attached copies of config.yaml, log file starting approx 24 hours before I upgraded, cscli metrics and an extract from top.

I have prometheus collecting stats for grafana but don't know how to extract the data. If you can give me a pointer, I can provide these as well. Let me know if there is anything else

config.yaml.txt crowdsec.log metrics.txt top.txt

What did you expect to happen?

LAPI node cpu usage to remain at approx 5%.

How can we reproduce it (as minimally and precisely as possible)?

Install version 1.4.0 in a multi-server setup.

Anything else we need to know?

No response

Crowdsec version

2022/07/21 11:58:04 version: v1.4.0-debian-pragmatic-865ff5c88dd133eb81a1128f8d4765b4be0cbd22 2022/07/21 11:58:04 Codename: alphaga 2022/07/21 11:58:04 BuildDate: 2022-07-19_09:24:14 2022/07/21 11:58:04 GoVersion: 1.17.5 2022/07/21 11:58:04 Platform: linux 2022/07/21 11:58:04 Constraint_parser: >= 1.0, <= 2.0 2022/07/21 11:58:04 Constraint_scenario: >= 1.0, < 3.0 2022/07/21 11:58:04 Constraint_api: v1 2022/07/21 11:58:04 Constraint_acquis: >= 1.0, < 2.0

OS version

PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"

Enabled collections and parsers

crowdsecurity/base-http-scenarios,enabled,0.6,http common : scanners detection,collections crowdsecurity/http-cve,enabled,1.0,,collections crowdsecurity/linux,enabled,0.2,core linux support : syslog+geoip+ssh,collections crowdsecurity/nginx,enabled,0.2,nginx support : parser and generic http scenarios,collections crowdsecurity/sshd,enabled,0.2,sshd support : parser and brute-force detection,collections crowdsecurity/dateparse-enrich,enabled,0.2,,parsers crowdsecurity/geoip-enrich,enabled,0.2,"Populate event with geoloc info : as, country, coords, source range.",parsers crowdsecurity/http-logs,enabled,0.8,"Parse more Specifically HTTP logs, such as HTTP Code, HTTP path, HTTP args and if its a static ressource",parsers crowdsecurity/nginx-logs,enabled,1.2,Parse nginx access and error logs,parsers crowdsecurity/sshd-logs,enabled,1.9,Parse openSSH logs,parsers crowdsecurity/syslog-logs,enabled,0.8,,parsers crowdsecurity/whitelists,enabled,0.2,Whitelist events from private ipv4 addresses,parsers sshd-zlogs-extra.yaml,"enabled,local",n/a,,parsers crowdsecurity/apache_log4j2_cve-2021-44228,enabled,0.4,Detect cve-2021-44228 exploitation attemps,scenarios crowdsecurity/f5-big-ip-cve-2020-5902,enabled,0.1,Detect cve-2020-5902 exploitation attemps,scenarios crowdsecurity/fortinet-cve-2018-13379,enabled,0.2,Detect cve-2018-13379 exploitation attemps,scenarios crowdsecurity/grafana-cve-2021-43798,enabled,0.1,Detect cve-2021-43798 exploitation attemps,scenarios crowdsecurity/http-backdoors-attempts,enabled,0.2,Detect attempt to common backdoors,scenarios crowdsecurity/http-bad-user-agent,enabled,0.7,Detect bad user-agents,scenarios crowdsecurity/http-crawl-non_statics,enabled,0.3,Detect aggressive crawl from single ip,scenarios crowdsecurity/http-cve-2021-41773,enabled,0.1,cve-2021-41773,scenarios crowdsecurity/http-cve-2021-42013,enabled,0.1,cve-2021-42013,scenarios crowdsecurity/http-generic-bf,enabled,0.2,Detect generic http brute force,scenarios crowdsecurity/http-open-proxy,enabled,0.2,Detect scan for open proxy,scenarios crowdsecurity/http-path-traversal-probing,enabled,0.2,Detect path traversal attempt,scenarios crowdsecurity/http-probing,enabled,0.2,Detect site scanning/probing from a single ip,scenarios crowdsecurity/http-sensitive-files,enabled,0.2,"Detect attempt to access to sensitive files (.log, .db ..) or folders (.git)",scenarios crowdsecurity/http-sqli-probing,enabled,0.2,A scenario that detects SQL injection probing with minimal false positives,scenarios crowdsecurity/http-xss-probing,enabled,0.2,A scenario that detects XSS probing with minimal false positives,scenarios crowdsecurity/jira_cve-2021-26086,enabled,0.1,Detect Atlassian Jira CVE-2021-26086 exploitation attemps,scenarios crowdsecurity/nginx-req-limit-exceeded,enabled,0.1,Detects IPs which violate nginx's user set request limit.,scenarios crowdsecurity/pulse-secure-sslvpn-cve-2019-11510,enabled,0.2,Detect cve-2019-11510 exploitation attemps,scenarios crowdsecurity/spring4shell_cve-2022-22965,enabled,0.2,Detect cve-2022-22965 probing,scenarios crowdsecurity/ssh-bf,enabled,0.1,Detect ssh bruteforce,scenarios crowdsecurity/ssh-slow-bf,enabled,0.2,Detect slow ssh bruteforce,scenarios crowdsecurity/thinkphp-cve-2018-20062,enabled,0.3,Detect ThinkPHP CVE-2018-20062 exploitation attemps,scenarios crowdsecurity/vmware-cve-2022-22954,enabled,0.2,Detect Vmware CVE-2022-22954 exploitation attempts,scenarios crowdsecurity/vmware-vcenter-vmsa-2021-0027,enabled,0.1,Detect VMSA-2021-0027 exploitation attemps,scenarios http-strict-probing.yaml,"enabled,local",n/a,,scenarios ltsich/http-w00tw00t,enabled,0.1,detect w00tw00t,scenarios ssh-extra.yaml,"enabled,local",n/a,,scenarios ssh-strict-bf.yaml,"enabled,local",n/a,,scenarios whitelists-extra.yaml,"enabled,local",n/a,,postoverflows whitelists-monitors.yaml,"enabled,local",n/a,,postoverflows

Acquisition config

#Generated acquisition file - wizard.sh (service: nginx) / files : /var/log/nginx/sjs.access.log /var/log/nginx/access.log /var/log/nginx/error.log /var/log/nginx/php7.4-fpm.log /var/log/nginx/hamish.access.log /var/log/nginx/grafana.log /var/log/nginx/weddell.access.log filenames: - /var/log/nginx/sjs.access.log - /var/log/nginx/access.log - /var/log/nginx/error.log - /var/log/nginx/php7.4-fpm.log - /var/log/nginx/hamish.access.log - /var/log/nginx/grafana.log - /var/log/nginx/weddell.access.log labels: type: nginx --- #Generated acquisition file - wizard.sh (service: sshd) / files : /var/log/auth.log filenames: - /var/log/auth.log labels: type: syslog --- #Generated acquisition file - wizard.sh (service: linux) / files : /var/log/syslog /var/log/messages filenames: - /var/log/syslog - /var/log/messages labels: type: syslog ---

Config show

Global: - Configuration Folder : /etc/crowdsec - Data Folder : /var/lib/crowdsec/data - Hub Folder : /etc/crowdsec/hub - Simulation File : /etc/crowdsec/simulation.yaml - Log Folder : /var/log/ - Log level : info - Log Media : file Crowdsec: - Acquisition File : /etc/crowdsec/acquis.yaml - Parsers routines : 1 cscli: - Output : human - Hub Branch : - Hub Folder : /etc/crowdsec/hub Local API Server: - Listen URL : 10.90.80.11:8080 - Profile File : /etc/crowdsec/profiles.yaml - Trusted IPs: - 127.0.0.1 - ::1 - Database: - Type : sqlite - Path : /var/lib/crowdsec/data/crowdsec.db - Flush age : 7d - Flush size : 5000

Prometheus metrics

No response

Related custom configs versions (if applicable) : notification plugins, custom scenarios, parsers etc.

Custom SSH Parser - [sshd-zlogs-extra.yaml.txt](https://github.com/crowdsecurity/crowdsec/files/9158831/sshd-zlogs-extra.yaml.txt) Custom Scenarios - [ssh-strict-bf.yaml.txt](https://github.com/crowdsecurity/crowdsec/files/9158840/ssh-strict-bf.yaml.txt) [ssh-extra.yaml.txt](https://github.com/crowdsecurity/crowdsec/files/9158841/ssh-extra.yaml.txt) [http-strict-probing.yaml.txt](https://github.com/crowdsecurity/crowdsec/files/9158842/http-strict-probing.yaml.txt)

buixor commented 2 years ago

thanks for the issue, this looks indeed abnormal.

I don't see anything weird in your logs or config, would you be able to do a pprof capture of cpu usage please : go tool pprof http://localhost:6060/debug/pprof/profile

The resulting .pb.gz file would be very helpfull to see where crowdsec is spending all your CPU :sweat_smile:

lleddewk commented 2 years ago

File attached. Hopefully this is what you asked for. crowdsec.pb.gz

eguaj commented 2 years ago

I'm also observing 100% CPU usage by the crowdsec daemon since the upgrade to v1.4.0.

It seems the problem occurs when the crowdsec-firewall-iptables bouncers starts and connects to the crowdsec server.

If I block bouncers API access (e.g. with a iptables REJECT rule) and restart the crowdsec daemon, then it runs with "normal" CPU usage, but as soon as I reopen bouncers API access, then it starts running at 100% CPU usage.

buixor commented 2 years ago

yes, thanks. we made some changes to improve decision deduplication, but it seems that it has some CPU cost. Can you confirm that what you are seeing are spikes when the bouncers are pulling data ?

eguaj commented 2 years ago

In the crowdsec daemon log I have this message when a bouncer connects to the API port:

time="21-07-2022 14:06:16" level=error msg="heartbeat error : API error: updating machine last_heartbeat: database is locked: unable to update"
time="21-07-2022 14:06:23" level=error msg="while fetching bouncer info: select bouncer: database is locked: unable to query" ip=x.x.x.x

lleddewk commented 2 years ago

Confirmed. When I stop the bouncers, cpu usage falls below 1%.

Edit: I'm not seeing that error in my logs.

eguaj commented 2 years ago

Stopping the bouncer seems to reduce the CPU usage after some time.

I stopped all my bouncers and just started one bouncer.

Here is the log when I stop that single bouncer:

time="21-07-2022 14:13:19" level=info msg="Starting processing data"
time="21-07-2022 14:16:57" level=warning msg="DeleteAlertGraph : database is locked"
time="21-07-2022 14:16:57" level=warning msg="DeleteAlertWithFilter : event with alert ID '13887': unable to delete"
time="21-07-2022 14:16:57" level=warning msg="FlushAlerts (max age) : event with alert ID '13887': unable to delete"
time="21-07-2022 14:17:23" level=warning msg="client x.x.x.x error : client disconnected"
time="21-07-2022 14:17:23" level=warning msg="stacktrace written to /tmp/crowdsec-crash.302720166.txt, please join to your issue"
time="21-07-2022 14:17:24" level=info msg="flushed 1/725 alerts because they were created 7d ago or more"

Locking problem with the sqlite database?

Here is the crowdsec-crash.302720166.txt file reported in the log.

eguaj commented 2 years ago

I deleted the content from the alerts, decisions and events tables, restarted the crowdsec daemon, waited for it to repopulate the decisions table from the log, and finally reopened API access to the bouncers.

The CPU usage remains high (>= 80%) for the past 2 hours and does not drops to the level observed before v1.4.0:

# sar -u
12:25:01        CPU     %user     %nice   %system   %iowait    %steal     %idle
[...]
14:25:01        all     31,63      0,00      4,01      0,31      0,27     63,77
14:35:01        all     19,92      0,00      2,58      0,32      0,32     76,86
14:45:01        all     18,27      0,00      3,25      3,21      0,37     74,91
14:55:01        all     46,64      0,00      5,34      3,33      0,32     44,37
15:05:01        all     82,79      0,00      8,23      1,56      0,28      7,14
15:15:01        all     78,81      0,00      7,67      1,40      0,26     11,86
15:25:02        all     79,14      0,00      7,68      1,40      0,25     11,52
15:35:01        all     78,03      0,00      7,63      1,39      0,26     12,69
15:45:01        all     68,20      0,00      6,96      1,12      0,29     23,42
15:55:01        all     80,48      0,00      7,96      1,26      0,27     10,03
16:05:01        all     82,42      0,00      7,73      1,25      0,28      8,33
16:15:01        all     81,08      0,00      7,64      1,19      0,26      9,83
16:25:01        all     82,80      0,00      8,07      1,44      0,29      7,40
16:35:01        all     83,71      0,00      8,29      1,41      0,25      6,34
16:45:01        all     82,04      0,00      8,07      1,33      0,24      8,32
16:55:01        all     82,56      0,00      8,09      1,60      0,30      7,45
Average:        all     29,97      0,02      4,05      0,92      0,19     64,85

Also, I noticed that running sqlite3 /var/lib/crowdsec/data/crowdsec.db to perform SELECT count(*) queries, I regularly get a Error: database is locked error message:

sqlite> SELECT count(*) FROM alerts; SELECT count(*) FROM decisions; SELECT count(*) FROM events;
18
15487
120
sqlite> SELECT count(*) FROM alerts; SELECT count(*) FROM decisions; SELECT count(*) FROM events;
Error: database is locked
sqlite> SELECT count(*) FROM alerts; SELECT count(*) FROM decisions; SELECT count(*) FROM events;
18
15487
120
sqlite> SELECT count(*) FROM alerts; SELECT count(*) FROM decisions; SELECT count(*) FROM events;
Error: database is locked
sqlite> SELECT count(*) FROM alerts; SELECT count(*) FROM decisions; SELECT count(*) FROM events;
Error: database is locked
sqlite> SELECT count(*) FROM alerts; SELECT count(*) FROM decisions; SELECT count(*) FROM events;
18
15487
120

buixor commented 2 years ago

We're working on a fix, the approach was way too costly. We'll keep you posted tomorrow, hopefully, 1.4.1 soon.

buixor commented 2 years ago

Hey ! Following the issue spotted yesterday with 1.4.0 and CPU usage of local API and stream mode bouncers, we've released a fix in 1.4.1-rc1 ! Feedback is more than welcome so that we can make this release official. You can grab 1.4.1-rc1 package on our testing repository : https://packagecloud.io/crowdsec/crowdsec-testing (note that you can as well grab the package directly on the repo without installing the repo itself!)

eguaj commented 2 years ago

Looks good to me!

I manually rebuilt the crowdsec binary from git v1.4.1-rc1 tag, replaced it on the server, restarted daemon, and the CPU usage is now back to "normal" with all the bouncers connected.

lleddewk commented 2 years ago

Looks good here as well. I installed from the testing repo. 4 bouncers are reconnected to the LAPI and I am seeing crowdsec peak at around 5% which is what I had before v1.4.0. Many thanks.

crowdsecurity / crowdsec