Closed lleddewk closed 2 years ago
thanks for the issue, this looks indeed abnormal.
I don't see anything weird in your logs or config, would you be able to do a pprof capture of cpu usage please : go tool pprof http://localhost:6060/debug/pprof/profile
The resulting .pb.gz
file would be very helpfull to see where crowdsec is spending all your CPU :sweat_smile:
File attached. Hopefully this is what you asked for. crowdsec.pb.gz
I'm also observing 100% CPU usage by the crowdsec daemon since the upgrade to v1.4.0.
It seems the problem occurs when the crowdsec-firewall-iptables bouncers starts and connects to the crowdsec server.
If I block bouncers API access (e.g. with a iptables REJECT rule) and restart the crowdsec daemon, then it runs with "normal" CPU usage, but as soon as I reopen bouncers API access, then it starts running at 100% CPU usage.
yes, thanks. we made some changes to improve decision deduplication, but it seems that it has some CPU cost. Can you confirm that what you are seeing are spikes when the bouncers are pulling data ?
In the crowdsec daemon log I have this message when a bouncer connects to the API port:
time="21-07-2022 14:06:16" level=error msg="heartbeat error : API error: updating machine last_heartbeat: database is locked: unable to update"
time="21-07-2022 14:06:23" level=error msg="while fetching bouncer info: select bouncer: database is locked: unable to query" ip=x.x.x.x
Confirmed. When I stop the bouncers, cpu usage falls below 1%.
Edit: I'm not seeing that error in my logs.
Stopping the bouncer seems to reduce the CPU usage after some time.
I stopped all my bouncers and just started one bouncer.
Here is the log when I stop that single bouncer:
time="21-07-2022 14:13:19" level=info msg="Starting processing data"
time="21-07-2022 14:16:57" level=warning msg="DeleteAlertGraph : database is locked"
time="21-07-2022 14:16:57" level=warning msg="DeleteAlertWithFilter : event with alert ID '13887': unable to delete"
time="21-07-2022 14:16:57" level=warning msg="FlushAlerts (max age) : event with alert ID '13887': unable to delete"
time="21-07-2022 14:17:23" level=warning msg="client x.x.x.x error : client disconnected"
time="21-07-2022 14:17:23" level=warning msg="stacktrace written to /tmp/crowdsec-crash.302720166.txt, please join to your issue"
time="21-07-2022 14:17:24" level=info msg="flushed 1/725 alerts because they were created 7d ago or more"
Locking problem with the sqlite database?
Here is the crowdsec-crash.302720166.txt file reported in the log.
I deleted the content from the alerts
, decisions
and events
tables, restarted the crowdsec daemon, waited for it to repopulate the decisions
table from the log, and finally reopened API access to the bouncers.
The CPU usage remains high (>= 80%) for the past 2 hours and does not drops to the level observed before v1.4.0:
# sar -u
12:25:01 CPU %user %nice %system %iowait %steal %idle
[...]
14:25:01 all 31,63 0,00 4,01 0,31 0,27 63,77
14:35:01 all 19,92 0,00 2,58 0,32 0,32 76,86
14:45:01 all 18,27 0,00 3,25 3,21 0,37 74,91
14:55:01 all 46,64 0,00 5,34 3,33 0,32 44,37
15:05:01 all 82,79 0,00 8,23 1,56 0,28 7,14
15:15:01 all 78,81 0,00 7,67 1,40 0,26 11,86
15:25:02 all 79,14 0,00 7,68 1,40 0,25 11,52
15:35:01 all 78,03 0,00 7,63 1,39 0,26 12,69
15:45:01 all 68,20 0,00 6,96 1,12 0,29 23,42
15:55:01 all 80,48 0,00 7,96 1,26 0,27 10,03
16:05:01 all 82,42 0,00 7,73 1,25 0,28 8,33
16:15:01 all 81,08 0,00 7,64 1,19 0,26 9,83
16:25:01 all 82,80 0,00 8,07 1,44 0,29 7,40
16:35:01 all 83,71 0,00 8,29 1,41 0,25 6,34
16:45:01 all 82,04 0,00 8,07 1,33 0,24 8,32
16:55:01 all 82,56 0,00 8,09 1,60 0,30 7,45
Average: all 29,97 0,02 4,05 0,92 0,19 64,85
Also, I noticed that running sqlite3 /var/lib/crowdsec/data/crowdsec.db
to perform SELECT count(*)
queries, I regularly get a Error: database is locked
error message:
sqlite> SELECT count(*) FROM alerts; SELECT count(*) FROM decisions; SELECT count(*) FROM events;
18
15487
120
sqlite> SELECT count(*) FROM alerts; SELECT count(*) FROM decisions; SELECT count(*) FROM events;
Error: database is locked
sqlite> SELECT count(*) FROM alerts; SELECT count(*) FROM decisions; SELECT count(*) FROM events;
18
15487
120
sqlite> SELECT count(*) FROM alerts; SELECT count(*) FROM decisions; SELECT count(*) FROM events;
Error: database is locked
sqlite> SELECT count(*) FROM alerts; SELECT count(*) FROM decisions; SELECT count(*) FROM events;
Error: database is locked
sqlite> SELECT count(*) FROM alerts; SELECT count(*) FROM decisions; SELECT count(*) FROM events;
18
15487
120
We're working on a fix, the approach was way too costly. We'll keep you posted tomorrow, hopefully, 1.4.1 soon.
Hey ! Following the issue spotted yesterday with 1.4.0 and CPU usage of local API and stream mode bouncers, we've released a fix in 1.4.1-rc1 ! Feedback is more than welcome so that we can make this release official. You can grab 1.4.1-rc1 package on our testing repository : https://packagecloud.io/crowdsec/crowdsec-testing (note that you can as well grab the package directly on the repo without installing the repo itself!)
Looks good to me!
I manually rebuilt the crowdsec binary from git v1.4.1-rc1
tag, replaced it on the server, restarted daemon, and the CPU usage is now back to "normal" with all the bouncers connected.
Looks good here as well. I installed from the testing repo. 4 bouncers are reconnected to the LAPI and I am seeing crowdsec peak at around 5% which is what I had before v1.4.0. Many thanks.
What happened?
I have a 4 node multi-server setup. All nodes are VPS linked by a wireguard VPN connection. After updating to version crowdsec 1.4.0 using the debian repo I am seeing consistently high CPU usage on the LAPI node. The 3 satellite nodes all remain at 1% to 5% cpu usage but the LAPI node ranges from 50% to 100%.
I have attached copies of config.yaml, log file starting approx 24 hours before I upgraded, cscli metrics and an extract from top.
I have prometheus collecting stats for grafana but don't know how to extract the data. If you can give me a pointer, I can provide these as well. Let me know if there is anything else
config.yaml.txt crowdsec.log metrics.txt top.txt
What did you expect to happen?
LAPI node cpu usage to remain at approx 5%.
How can we reproduce it (as minimally and precisely as possible)?
Install version 1.4.0 in a multi-server setup.
Anything else we need to know?
No response
Crowdsec version
OS version
Enabled collections and parsers
Acquisition config
Config show
Prometheus metrics
No response
Related custom configs versions (if applicable) : notification plugins, custom scenarios, parsers etc.