NethServer / dev

NethServer issue tracker
https://github.com/NethServer/dev/issues
62 stars 20 forks source link

Crowdsec fails to start at boot #6934

Closed nrauso closed 3 months ago

nrauso commented 4 months ago

In some circumstances, when restarting NS8 the systemd unit of the crowdsec module fails to start. The problem is related to the state of the systemd unit of redis service: if crowdsec finds redis stopped, it cannot complete its configuration provisioning (smarthost) and fails.

Steps to reproduce

Expected behavior

crowdsec systemd unit should run flawlessly at boot.

Actual behavior

crowdsec systemd unit trys to start but it fails with this output:

May 19 19:32:57 ns crowdsec1[1240]: Traceback (most recent call last):
May 19 19:32:57 ns crowdsec1[1240]:  File "/usr/local/agent/pyenv/lib64/python3.11/site-packages/redis/connection.py", line 264, in connect
May 19 19:32:57 ns crowdsec1[1240]:    sock = self.retry.call_with_retry(
May 19 19:32:57 ns crowdsec1[1240]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
May 19 19:32:57 ns crowdsec1[1240]:  File "/usr/local/agent/pyenv/lib64/python3.11/site-packages/redis/retry.py", line 46, in call_with_retry
May 19 19:32:57 ns crowdsec1[1240]:    return do()
May 19 19:32:57 ns crowdsec1[1240]:           ^^^^
May 19 19:32:57 ns crowdsec1[1240]:  File "/usr/local/agent/pyenv/lib64/python3.11/site-packages/redis/connection.py", line 265, in <lambda>
May 19 19:32:57 ns crowdsec1[1240]:    lambda: self._connect(), lambda error: self.disconnect(error)
May 19 19:32:57 ns crowdsec1[1240]:            ^^^^^^^^^^^^^^^
May 19 19:32:57 ns crowdsec1[1240]:  File "/usr/local/agent/pyenv/lib64/python3.11/site-packages/redis/connection.py", line 627, in _connect
May 19 19:32:57 ns crowdsec1[1240]:    raise err
May 19 19:32:57 ns crowdsec1[1240]:  File "/usr/local/agent/pyenv/lib64/python3.11/site-packages/redis/connection.py", line 615, in _connect
May 19 19:32:57 ns crowdsec1[1240]:    sock.connect(socket_address)
May 19 19:32:57 ns crowdsec1[1240]: ConnectionRefusedError: [Errno 111] Connection refused
May 19 19:32:57 ns crowdsec1[1240]: During handling of the above exception, another exception occurred:
May 19 19:32:57 ns crowdsec1[1240]: Traceback (most recent call last):
May 19 19:32:57 ns crowdsec1[1240]:  File "/var/lib/nethserver/crowdsec1/bin/expand-configuration", line 82, in <module>
May 19 19:32:57 ns crowdsec1[1240]:    smtp = agent.get_smarthost_settings(agent.redis_connect())
May 19 19:32:57 ns crowdsec1[1240]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
May 19 19:32:57 ns crowdsec1[1240]:  File "/usr/local/agent/pypkg/agent/__init__.py", line 509, in get_smarthost_settings
May 19 19:32:57 ns crowdsec1[1240]:    conf = rdb.hgetall('cluster/smarthost')
May 19 19:32:57 ns crowdsec1[1240]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
May 19 19:32:57 ns crowdsec1[1240]:  File "/usr/local/agent/pyenv/lib64/python3.11/site-packages/redis/commands/core.py", line 4954, in hgetall
May 19 19:32:57 ns crowdsec1[1240]:    return self.execute_command("HGETALL", name)
May 19 19:32:57 ns crowdsec1[1240]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
May 19 19:32:57 ns crowdsec1[1240]:  File "/usr/local/agent/pyenv/lib64/python3.11/site-packages/redis/client.py", line 533, in execute_command
May 19 19:32:57 ns crowdsec1[1240]:    conn = self.connection or pool.get_connection(command_name, **options)
May 19 19:32:57 ns crowdsec1[1240]:                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
May 19 19:32:57 ns crowdsec1[1240]:  File "/usr/local/agent/pyenv/lib64/python3.11/site-packages/redis/connection.py", line 1086, in get_connection
May 19 19:32:57 ns crowdsec1[1240]:    connection.connect()
May 19 19:32:57 ns crowdsec1[1240]:  File "/usr/local/agent/pyenv/lib64/python3.11/site-packages/redis/connection.py", line 270, in connect
May 19 19:32:57 ns crowdsec1[1240]:    raise ConnectionError(self._error_message(e))
May 19 19:32:57 ns crowdsec1[1240]: redis.exceptions.ConnectionError: Error 111 connecting to cluster-leader:6379. Connection refused.
...
May 19 19:32:58 ns systemd[1]: crowdsec1.service: Control process exited, code=exited, status=1/FAILURE

After 5 attempts done quickly, the start of the unit definitively fails:

May 19 19:33:01 ns systemd[1]: crowdsec1.service: Control process exited, code=exited, status=1/FAILURE
May 19 19:33:01 ns systemd[1]: crowdsec1.service: Failed with result 'exit-code'.
May 19 19:33:01 ns systemd[1]: Failed to start crowdsec server.
May 19 19:33:01 ns systemd[1]: crowdsec1.service: Scheduled restart job, restart counter is at 5.
May 19 19:33:01 ns systemd[1]: Stopped crowdsec server.
May 19 19:33:01 ns systemd[1]: crowdsec1.service: Start request repeated too quickly.
May 19 19:33:01 ns systemd[1]: crowdsec1.service: Failed with result 'exit-code'.
May 19 19:33:01 ns systemd[1]: Failed to start crowdsec server.

A few seconds later, the startup of redis is traced in the log:

May 19 19:33:13 ns redis[2284]: 1:C 19 May 2024 19:33:13.375 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
May 19 19:33:13 ns redis[2284]: 1:C 19 May 2024 19:33:13.375 * Redis version=7.2.3, bits=64, commit=00000000, modified=0, pid=1, just started
May 19 19:33:13 ns redis[2284]: 1:C 19 May 2024 19:33:13.375 * Configuration loaded
May 19 19:33:13 ns redis[2284]: 1:M 19 May 2024 19:33:13.376 * monotonic clock: POSIX clock_gettime
May 19 19:33:13 ns redis[2284]: 1:M 19 May 2024 19:33:13.377 * Running mode=standalone, port=6379.
May 19 19:33:13 ns redis[2284]: 1:M 19 May 2024 19:33:13.379 * Server initialized
May 19 19:33:13 ns redis[2284]: 1:M 19 May 2024 19:33:13.380 * Loading RDB produced by version 7.2.3
May 19 19:33:13 ns redis[2284]: 1:M 19 May 2024 19:33:13.380 * RDB age 160 seconds
May 19 19:33:13 ns redis[2284]: 1:M 19 May 2024 19:33:13.380 * RDB memory usage when created 3.14 Mb
May 19 19:33:13 ns redis[2284]: 1:M 19 May 2024 19:33:13.382 * Done loading RDB, keys loaded: 544, keys expired: 0.
May 19 19:33:13 ns redis[2284]: 1:M 19 May 2024 19:33:13.382 * DB loaded from disk: 0.003 seconds
May 19 19:33:13 ns redis[2284]: 1:M 19 May 2024 19:33:13.382 * Ready to accept connections tcp

Adding to crowdsec unit a dependence to redis.service workarounded the issue:

~]# systemctl cat crowdsec1.service
# /etc/systemd/system/crowdsec1.service
#
# Copyright (C) 2022 Nethesis S.r.l.
# SPDX-License-Identifier: GPL-3.0-or-later
#

#
# This systemd unit starts a crowdsec instance using Podman.
# Most parts of this file come from podman-generate-systemd.
#

[Unit]
Description=crowdsec server
After=network.target redis.service

Components

stephdl commented 4 months ago

QA

Install ghcr.io/nethserver/crowdsec:1.0.9-dev.1
check the bug is not reproducible
nrauso commented 4 months ago

test case: VERIFIED

DavidePrincipi commented 3 months ago

Released https://github.com/NethServer/ns8-crowdsec/releases/tag/1.0.9