bcleonard / proxmox-qdevice

24 stars 9 forks source link

corosync-qnetd entered FATAL state / sshd terminated by SIGABRT (core dumped) #2

Open derandiunddasbo opened 6 months ago

derandiunddasbo commented 6 months ago

I'm trying to run the container on a (x86) Synology NAS, but neither the corosync-qnetd does come up, nor the ssh server seems to run properly:

When booting the container, supervisord tries to run corosync-qnetd a couple of times and is eventually giving up. This is the container's log after starting the container:

2024-04-03 16:31:34,650 INFO Set uid to user 0 succeeded
2024-04-03 16:31:34,658 INFO supervisord started with pid 1
2024-04-03 16:31:35,662 INFO spawned: 'set_root_password' with pid 8
2024-04-03 16:31:35,667 INFO spawned: 'corosync-qnetd' with pid 9
2024-04-03 16:31:35,671 INFO spawned: 'sshd' with pid 10
2024-04-03 16:31:35,702 INFO success: set_root_password entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-04-03 16:31:36,638 INFO exited: set_root_password (exit status 0; expected)
2024-04-03 16:31:36,638 WARN exited: corosync-qnetd (exit status 1; not expected)
2024-04-03 16:31:37,643 INFO spawned: 'corosync-qnetd' with pid 15
2024-04-03 16:31:37,644 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-03 16:31:37,657 WARN exited: corosync-qnetd (exit status 1; not expected)
2024-04-03 16:31:39,662 INFO spawned: 'corosync-qnetd' with pid 16
2024-04-03 16:31:39,676 WARN exited: corosync-qnetd (exit status 1; not expected)
2024-04-03 16:31:42,683 INFO spawned: 'corosync-qnetd' with pid 17
2024-04-03 16:31:42,697 WARN exited: corosync-qnetd (exit status 1; not expected)
2024-04-03 16:31:42,698 INFO gave up: corosync-qnetd entered FATAL state, too many start retries too quickly

Also, sshd seems to crash on every connection attempt and get's restarted by supervisord. This is the connection attempt with a ssh client in verbose mode:

# ssh -v root@192.168.4.12
OpenSSH_9.2p1 Debian-2+deb12u2, OpenSSL 3.0.11 19 Sep 2023
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug1: Connecting to 192.168.4.12 [192.168.4.12] port 22.
debug1: Connection established.
debug1: identity file /root/.ssh/id_rsa type -1
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa_sk type -1
debug1: identity file /root/.ssh/id_ecdsa_sk-cert type -1
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: identity file /root/.ssh/id_ed25519_sk type -1
debug1: identity file /root/.ssh/id_ed25519_sk-cert type -1
debug1: identity file /root/.ssh/id_xmss type -1
debug1: identity file /root/.ssh/id_xmss-cert type -1
debug1: identity file /root/.ssh/id_dsa type -1
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_9.2p1 Debian-2+deb12u2
debug1: Remote protocol version 2.0, remote software version OpenSSH_9.2p1 Debian-2
debug1: compat_banner: match: OpenSSH_9.2p1 Debian-2 pat OpenSSH* compat 0x04000000
debug1: Authenticating to 192.168.4.12:22 as 'root'
debug1: load_hostkeys: fopen /root/.ssh/known_hosts2: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory
debug1: SSH2_MSG_KEXINIT sent
Connection reset by 192.168.4.12 port 22

And this is the output from the container's supervisord:

2024-04-03 16:33:59,843 INFO reaped unknown pid 27 (terminated by SIGABRT (core dumped))

When trying to start corosync-qnetd inside the container, I get an error message:

# docker exec -it proxmox-qdevice bash
root@proxmox-qdevice:/# /usr/bin/corosync-qnetd -f
Apr 03 16:35:14 error   Can't open NSS DB directory (2): No such file or directory

This is the docker-compose.yaml I'm using:

services:
  qnetd:
    container_name: proxmox-qdevice
    image: bcleonard/proxmox-qdevice
    build: .
    ports:
      - "22:22"
      - "5403:5403"
    environment:
      - "NEW_ROOT_PASSWORD=xxx"
    volumes:
      - /run/sshd
      - /volume1/docker/proxmox-qdevice/etc-corosync:/etc/corosync
    restart: unless-stopped
    hostname: proxmox-qdevice
    networks:
      vlan:
        ipv4_address: 192.168.4.12

networks:
 vlan:
    driver: macvlan
    driver_opts:
      parent: eth0
      macvlan_mode: bridge
    ipam:
      config:
        - subnet: "192.168.4.0/24"
          ip_range: "192.168.4.12/32"
          gateway: "192.168.4.254"

Maybe I'm just missing something obvious, f.e. missing file/folder permissions, but the Readme doesn't mention anything in this direction.

bcleonard commented 6 months ago

I don't have access to a Synology NAS, so I've never tested it on a such a device. whats the output of the following "docker exec -it proxmox-qdevice ls -al /etc/corosync/qnetd/nssdb"?

derandiunddasbo commented 6 months ago

After some more testing this seems to be an issue specific to the Synology device. Maybe it's not dealing well with the bookworm-slim binaries.

I successfully ran the container on another docker host and I was able to configure the qdevice. I subsequently stopped the container on the other host, copied the generated /etc/corocync files back to the Synology and re-ran the container from there. Now it's running successfully on the Synology as well and is voting for the nodes of the cluster.

Nevertheless, sshd is still crashing when trying to connect to this container. But the proxmox cluster currently doesn't seem to care about that. I assume, it doesn't need to connect via ssh anymore, after the qdevice is initially configured.

I'll try to build the image with another base image and see, if this makes a difference. There are a couple of other containers running fine for years on the NAS, mostly derived from alpine or ubuntu images.

bcleonard commented 6 months ago

It could be specific to a Synology device. It feels like somehow the permissions for you're volume aren't correct. You generated the files on a separate box and then copied them back over (in other words, you created them by hand, outside of the container) and it worked. Without the directory listing its hard to tell.

its possible you've run into to two separate issues. One for the corosync files and a 2nd one for ssh.

In regards to latter issue, its possible that synology is not providing some basic container functionality that you get when running docker on a either a virtual instance or physical box.

I'm still curious as to the output of the ls command.

derandiunddasbo commented 6 months ago

As the container doesn't startup properly, it doesn't generate the /etc/corosync/qnetd/nssdb folder, thus I can't tell, what permissions would be set.

This is the (empty) /etc/corosync folder from the container, but there isn't much to see:

root@proxmox-qdevice-test:/# ls -la /etc/corosync/
total 0
drwxr-xr-x 1 root root    0 Apr  7 05:33 .
drwxr-xr-x 1 root root 1396 Apr  7 05:40 ..
root@proxmox-qdevice-test:/#

I did some more research and I think, the old linux kernel of the synology is the problem. When I manually ran the sshd server in debug mode, it exited with an Fatal glibc error: cannot get entropy for arc4random:

root@proxmox-qdevice-test:/# /etc/init.d/ssh stop
Stopping OpenBSD Secure Shell server: sshd.
root@proxmox-qdevice-test:/# /usr/sbin/sshd -d
debug1: sshd version OpenSSH_9.2, OpenSSL 3.0.9 30 May 2023
debug1: private host key #0: ssh-rsa SHA256:evppNYhSngvfYnxWy6L6rPFbgizq3R8CE9xJH9s3KQ4
debug1: private host key #1: ecdsa-sha2-nistp256 SHA256:q8NXGE3D7J00iLuHxiMsMPven4PEakFJBG/S9p7lRiY
debug1: private host key #2: ssh-ed25519 SHA256:pLws2Dmwo2gq4S8n2+IT0qjhKOIWIegFmOkVIQgdNkI
debug1: rexec_argv[0]='/usr/sbin/sshd'
debug1: rexec_argv[1]='-d'
debug1: Set /proc/self/oom_score_adj from 0 to -1000
debug1: Bind to port 22 on 0.0.0.0.
Server listening on 0.0.0.0 port 22.
debug1: Bind to port 22 on ::.
Server listening on :: port 22.
debug1: Server will not fork when running in debugging mode.
debug1: rexec start in 6 out 6 newsock 6 pipe -1 sock 9
debug1: sshd version OpenSSH_9.2, OpenSSL 3.0.9 30 May 2023
debug1: private host key #0: ssh-rsa SHA256:evppNYhSngvfYnxWy6L6rPFbgizq3R8CE9xJH9s3KQ4
debug1: private host key #1: ecdsa-sha2-nistp256 SHA256:q8NXGE3D7J00iLuHxiMsMPven4PEakFJBG/S9p7lRiY
debug1: private host key #2: ssh-ed25519 SHA256:pLws2Dmwo2gq4S8n2+IT0qjhKOIWIegFmOkVIQgdNkI
debug1: inetd sockets after dupping: 5, 5
Connection from 192.168.4.250 port 52616 on 192.168.4.14 port 22 rdomain ""
debug1: Local version string SSH-2.0-OpenSSH_9.2p1 Debian-2
debug1: Remote protocol version 2.0, remote software version OpenSSH_9.2p1 Debian-2+deb12u2
debug1: compat_banner: match: OpenSSH_9.2p1 Debian-2+deb12u2 pat OpenSSH* compat 0x04000000
debug1: permanently_set_uid: 100/65534 [preauth]
debug1: ssh_sandbox_child: prctl(PR_SET_SECCOMP): Invalid argument [preauth]
debug1: list_hostkey_types: rsa-sha2-512,rsa-sha2-256,ecdsa-sha2-nistp256,ssh-ed25519 [preauth]
Fatal glibc error: cannot get entropy for arc4random
debug1: monitor_read_log: child log fd closed
debug1: do_cleanup
debug1: Killing privsep child 144
debug1: audit_event: unhandled event 12

This is a problem with recent OpenSSH versions using arc4random, which the ancient synology kernel doesn't support. When building the image from debian:bullseye-slim, sshd is running fine, because of the older OpenSSH version.

I think there's not much, that can be done about this problem, apart from building the image from a Debian < 12 base image on Synology devices, as Synology is notorious for almost never updating the linux kernels of their device's firmware. :-)

bcleonard commented 6 months ago

Based on your research, it does appear th synology kernel is causing the sshd problems.

Typically, when you can't write/create files within docker on external volumes its permission. RHEL (and its derivitives) had all kinds of problems with selinux. What is running the container? For me:

[bradley@appserver05 ~]$ docker exec -it proxmox-qdevice id
uid=0(root) gid=0(root) groups=0(root)

and the qnetd directory ends up like this:

[bradley@appserver05 ~]$ docker exec -it proxmox-qdevice ls -al /etc/corosync
total 16
drwxr-xr-x 3 root root 4096 Aug  8  2023 .
drwxr-xr-x 1 root root 4096 Mar 21 15:51 ..
drwxr-xr-x 3 root root 4096 Aug  8  2023 qnetd

the full structure looks like this:

[bradley@appserver05 ~]$ docker exec -it proxmox-qdevice ls -al /etc/corosync/qnetd/nssdb
total 112
drwxr-x--- 2 root coroqnetd  4096 Aug  8  2023 .
drwxr-xr-x 3 root root       4096 Aug  8  2023 ..
-rw-r----- 1 root coroqnetd 28672 Aug  8  2023 cert9.db
-rw-r----- 1 root root        683 Aug  8  2023 cluster-stygianresearch.crt
-rw-r----- 1 root coroqnetd 53248 Aug  8  2023 key4.db
-rw-r----- 1 root root         41 Aug  8  2023 noise.txt
-rw-r----- 1 root root        432 Aug  8  2023 pkcs11.txt
-rw-r----- 1 root root          0 Aug  8  2023 pwdfile.txt
-rw-r--r-- 1 root root       4272 Aug  8  2023 qnetd-cacert.crt
-rw-r----- 1 root root          4 Aug  8  2023 serial.txt

as you can see, qnetd process will create directories & files with the group id of coroqnetd.

I'm not sure what else can be done other than to downgrade to debian 11.

millerim commented 5 months ago

Did this get resolved? I'd like to run this container on my Synology. I'm happy to troubleshoot if that's helpful.

bcleonard commented 5 months ago

No. I don't have access to a Synology box so i have no way of testing.

I'm not comfortable downgrading the base image to bullseye to cover this specific use case.

I know that you can build/tag different versions of dockerfiles but I have no idea how to do that. Container with tag x would be bookworm based, tag y would be bullseye based and even tag z could be arm based (there was a request for that).

If somebody can point me to half-way decent instructions on how to set that up, I'll take a swing at it.

derandiunddasbo commented 5 months ago

Did this get resolved? I'd like to run this container on my Synology. I'm happy to troubleshoot if that's helpful.

As written, the only viable "solution" for this issue is building this image yourself with an older debian base image. There is very little chance, Synology will ever update their linux kernel to a more recent version, just because of some customers wanting to run recent docker images on their "old" hardware.

I've gone with it, as the container is building and running fine with just a single change to the Dockerfile.

For building the image yourself, you just have to download at least "Dockerfile", "set_root_password.sh" and "supervisord.conf" from this repo to a folder on your Synology NAS (or clone the complete repo), cd into that folder, edit the Dockerfile and change the first line to:

FROM debian:bullseye-slim
[...]

After building the image with docker build -t thisismyprivatedockerrepo/proxmox-qdevice . you can replace the image in the docker-compose.yml...

services:
  qnetd:
    container_name: proxmox-qdevice
#    image: bcleonard/proxmox-qdevice
    image: thisismyprivatedockerrepo/proxmox-qdevice
[...]

...and run the container.

millerim commented 4 months ago

Wonderful roadmap, which enabled me to get this working on my old DSM 6.2 Synology! I had to make a couple changes: I switched to a standard bridge network and I mapped ssh to a different port on the Synology, which required an entry in the .ssh/config files on the Proxmox nodes so they could ssh into the qdevice container. Thanks for clarifying this for me!

pandel commented 4 months ago

@millerim Would you mind posting your compose.yaml and the entries (anonymized for sure) of your .ssh/config file? I am trying to get this up and running, too, but somehow, even though the container starts the sshd stops again and maybe there is a simple configuration bug.

millerim commented 4 months ago

Here you go:

docker-compose.yml services: qnetd: container_name: proxmox-qdevice image: {username}/proxmox-qdevice build: . ports:

networks: qnetlan: driver: bridge driver_opts: parent: eth0 ipam: driver: default config:

.ssh/config Host {IP Address of Synology} HostName {IP Address of Synology} User root Port 222

pandel commented 4 months ago

Hey @millerim ! Many thanks! Your configuration is nearly the same as mine, so I'll have to take a deeper look into what is going wrong here.

pandel commented 4 months ago

FWIW: After building the image like described by @derandiunddasbo and using the sligthly modified compose.yaml posted by @millerim

services:
  qnetd:
    container_name: proxmox-qdevice
    image: thisismyprivatedockerrepo/proxmox-qdevice
    ports:
      - "2222:22"
      - "5403:5403"
    environment:
      - "NEW_ROOT_PASSWORD=<your password>"
    volumes:
      - /run/sshd
      - /volume1/docker/qdevice/corosync-data:/etc/corosync
    restart: unless-stopped
    networks: 
      qdevice-net:
        ipv4_address: 172.18.1.20

networks:
  qdevice-net:
    name: qdevice-net
    driver: bridge
    driver_opts:
      parent: eth0
    ipam:
      driver: default
      config:
        - subnet: "172.18.0.0/16"

the container started, but corosync-qnetd still exited with error 1, but luckily, sshd started and so the container did not stop completely again. So I had the chance to step inside.

I then run

docker exec -ti proxmox-qdevice /bin/bash

to examine the container. It turned out, that running

corosync-qnetd -d -d -f

always resulted in an error similar to

corosync-qnetd: Can't open NSS DB directory.

I examined the apt install process and found out, that during package configuration some files and directories are created inside /etc/corosync. These folders and files were now missing, as we bind mount the folder from within the compose.yaml and this folder is empty initially.

So, I executed

dpkg-reconfigure corosync-qnetd

This created a new folder /etc/corosync/qnetd/nssdb and placed some initial files inside. I then exited the container.

After docker-compose stop and docker-compose start the corosync-qnetd daemon started successfully.

So it seems, that during the initial package setup while building the image the folder /etc/corosync/qnetd/nssdb is being filled with some basic files. These files are missing when you bind mount the folder from outside via the - /volume1/docker/qdevice/corosync-data:/etc/corosync line in the compose.yaml.

I don't know if it is possible to bind mount the external /etc/corosync folder during the docker build phase to make sure these files won't get lost.

millerim commented 4 months ago

I believe it's normal for qnetd to fail before the qdevice is added to the quorum. Once it is added, all the files are created and it starts successfully.

pandel commented 4 months ago

Ok, I did not know that. I thought the corosync-qnetd is needed for the whole pairing process, but you might be right, files are being copied over via ssh so it might be ok to start with a failing process and restart the container afterwards.

But it was interesting to get a bit deeper into all of that to get a better understanding. So it wasn't wasted time in the end 😁