Azure / iotedge

The IoT Edge OSS project
MIT License
1.45k stars 457 forks source link

Iotedge fails to make docker-proxy.sock after reboot (file exists) in Ubuntu Core #7255

Open CharleeSF opened 3 months ago

CharleeSF commented 3 months ago

Expected Behavior

If I have setup and installed azure-iot-edge snap I should be able to reboot the device and azure-iot-edge should start without any issues.

Current Behavior

azure-iot-edge fails to start with the following logs:

2024-03-27T17:29:35Z systemd[1]: Started Service for snap application azure-iot-edge.docker-proxy.
2024-03-27T17:29:35Z azure-iot-edge.docker-proxy[1201]: 2024/03/27 17:29:35 socat[1201] E "/var/snap/azure-iot-edge/common/docker-proxy.sock" exists
2024-03-27T17:29:35Z systemd[1]: snap.azure-iot-edge.docker-proxy.service: Main process exited, code=exited, status=1/FAILURE
2024-03-27T17:29:35Z systemd[1]: snap.azure-iot-edge.docker-proxy.service: Failed with result 'exit-code'.
2024-03-27T17:29:35Z systemd[1]: snap.azure-iot-edge.docker-proxy.service: Scheduled restart job, restart counter is at 5.
2024-03-27T17:29:35Z systemd[1]: Stopped Service for snap application azure-iot-edge.docker-proxy.
2024-03-27T17:29:35Z systemd[1]: snap.azure-iot-edge.docker-proxy.service: Start request repeated too quickly.
2024-03-27T17:29:35Z systemd[1]: snap.azure-iot-edge.docker-proxy.service: Failed with result 'exit-code'.
2024-03-27T17:29:35Z systemd[1]: Failed to start Service for snap application azure-iot-edge.docker-proxy.
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]: 2024-03-27T17:29:38Z [INFO] - Device is QemuDevice1 on clp-dev-iq-shared-8ccd7.azure-devices.net
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]: 2024-03-27T17:29:38Z [INFO] - Initializing module runtime...
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]: 2024-03-27T17:29:38Z [INFO] - Using runtime network id azure-iot-edge
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]: 2024-03-27T17:29:38Z [WARN] - container runtime error
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]: Caused by:
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]:     0: error trying to connect: Connection refused (os error 111)
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]:     1: Connection refused (os error 111)
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]: 2024-03-27T17:29:38Z [ERR!] - Failed to initialize module runtime: runtime operation error: initialize module runtime
2024-03-27T17:29:38Z systemd[1]: snap.azure-iot-edge.aziot-edged.service: Main process exited, code=exited, status=1/FAILURE
2024-03-27T17:29:38Z systemd[1]: snap.azure-iot-edge.aziot-edged.service: Failed with result 'exit-code'.
2024-03-27T17:29:38Z systemd[1]: snap.azure-iot-edge.aziot-edged.service: Scheduled restart job, restart counter is at 1.
2024-03-27T17:29:38Z systemd[1]: Stopped Service for snap application azure-iot-edge.aziot-edged.

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Install azure-iot-edge and azure-iot-identity (both on latest/stable, revision 32 and 27 respectively)
  2. Setup azure-iot-edge with primary connection string (FYI: for me it only works with setting raw-config, otherwise I get permission errors)
    sudo echo "[provisioning]
    source = \"manual\"
    connection_string = \"$MY_CONNECTION_STRING\"" > config.toml
    sudo snap set azure-iot-edge raw-config="$(cat config.toml)"

    I verify if it is working by checking the logs

    snap logs azure-iot-edge -f
  3. Reboot the device: now azure-iot-edge fails to start.

Context (Environment)

Output of iotedge check

Seems to be OK other than docker-proxy.sock not being accesible, which is the bug I'm reporting ``` charlee@ubuntucore:~$ sudo iotedge check Configuration checks (aziot-identity-service) --------------------------------------------- √ keyd configuration is well-formed - OK √ certd configuration is well-formed - OK √ tpmd configuration is well-formed - OK √ identityd configuration is well-formed - OK √ daemon configurations up-to-date with config.toml - OK √ identityd config toml file specifies a valid hostname - OK √ aziot-identity-service package is up-to-date - OK √ host time is close to reference time - OK √ preloaded certificates are valid - OK √ keyd is running - OK √ certd is running - OK √ identityd is running - OK √ read all preloaded certificates from the Certificates Service - OK √ read all preloaded key pairs from the Keys Service - OK √ check all EST server URLs utilize HTTPS - OK √ ensure all preloaded certificates match preloaded private keys with the same ID - OK Connectivity checks (aziot-identity-service) -------------------------------------------- √ host can connect to and perform TLS handshake with iothub AMQP port - OK √ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK √ host can connect to and perform TLS handshake with iothub MQTT port - OK Configuration checks -------------------- √ aziot-edged configuration is well-formed - OK √ configuration up-to-date with config.toml - OK × container engine is installed and functional - Error Could not communicate with container engine at unix:///var/snap/azure-iot-edge/common/docker-proxy.sock. Please check your moby-engine installation and ensure the service is running. √ aziot-edge package is up-to-date - OK ‼ DNS server - Warning Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub. Please see https://aka.ms/iotedge-prod-checklist-dns for best practices. You can ignore this warning if you are setting DNS server per module in the Edge deployment. ‼ production readiness: logs policy - Warning Container engine is not configured to rotate module logs which may cause it run out of disk space. Please see https://aka.ms/iotedge-prod-checklist-logs for best practices. You can ignore this warning if you are setting log policy per module in the Edge deployment. √ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK Connectivity checks ------------------- 23 check(s) succeeded. 2 check(s) raised warnings. Re-run with --verbose for more details. 1 check(s) raised errors. Re-run with --verbose for more details. 11 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details. ```

Device Information

Runtime Versions

Workaround

sudo rm /var/snap/azure-iot-edge/common/docker-proxy.sock && sudo snap restart azure-iot-edge

Notes

Not sure if it is reproducible 100% of the time, but it occurs often enough to be a problem. Especially since my device is supposed to be able to boot without human interaction.

CharleeSF commented 3 months ago

Also see the discussion here: https://forum.snapcraft.io/t/possible-to-run-scripts-as-root-on-startup-on-ubuntu-core/39580/7

And the comment by mborzecki1:

I briefly looked at the snap. The docker-proxy service should really be made a socket activated service, then systemd starts listening on the socket and that nonsense of unlink/reuseaddr is gone. Unfortunately, you’d need to add code to hand over the file descriptor provided by systemd to socat. Other servies which may use what docker-proxy provides should use after: [docker-proxy] in their declarations. This needs to be fixed in snap packaging, so there’s not much you can do, except for opening some PRs or filing bugs if you’re simply consuming the snap.

damonbarry commented 3 months ago

Hi @CharleeSF I tried to repro locally and couldn't, and we haven't seen this in our internal testing. But I see from the forum thread that the problem is fairly well understood. @alexclewontin can you comment on the suggested approach in the forum thread?

CharleeSF commented 3 months ago

Hey @damonbarry, thanks for looking into this!

I am also not always able to reproduce it. The strange thing is, once I can reproduce it, it happens on every reboot.. But not every setup has it.

Since the setup takes quite a lot of time I haven't tried to get a 100% reproduction scenario.

I have however, also seen that the docker-proxy fails due to /var/run/docker.sock not being available yet.

May I ask why edged doesn't talk to /var/run/docker.sock directly?

CharleeSF commented 2 months ago

Hey @damonbarry,

Is there any progress on this? I just wanted to mention that I also regularly see that docker seems slower/later in booting than azure-iot-edge, resulting in behavior like this:

2024-04-17T17:51:26Z systemd[1]: Started Service for snap application azure-iot-edge.aziot-edged.
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1825]: Making /var/run/iotedge if it does not exist
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1825]: Successfully made /var/run/iotedge if it did not exist
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Starting Azure IoT Edge Daemon
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Version - 1.4.33
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Obtaining Edge device provisioning data...
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Device is SnapeQemuAuto2 on iq-shared-0-8ccd7.azure-devices.net
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Initializing module runtime...
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Using runtime network id azure-iot-edge
2024-04-17T17:51:26Z azure-iot-edge.docker-proxy[1833]: 2024/04/17 17:51:26 socat[1833] E connect(5, AF=1 "/var/run/docker.sock", 22): No such file or directory
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [WARN] - container runtime error
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: Caused by:
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]:     channel closed

Further supporting a revision of the docker-proxy behavior.. (in the snapcraft forum thread a solution is suggested for this, by waiting for the /var/run/docker.sock to become available).

Restarting the docker-proxy daemon fixes the issue, but as mentioned before, I don't want to have to do anything on my device for it to boot properly. I think a slower interval between restarting the daemon of docker-proxy would also help. It seems to do 8 retries but they are all before docker has made the socket available.

I am testing azure-iot-edge with quite heavy workloads, maybe that's why docker boots slower?

alexclewontin commented 2 months ago

I can weigh in more eventually, but quickly hopping in to provide some context on why the proxy exists and iotedge doesn't talk to docker directly:

The issue is that in all-snap environments (i.e. Ubuntu Core) docker is provided as a snap, and there is no docker group, so you essentially cannot talk to docker.sock if you are running as UID != 0. aziot-edged runs as user snap_aziotedge and so the docker proxy runs as root in the context of the iotedge snap, but provides the proxy socket with snap_aziotedge ownership to let aziot-edged "escalate" its privileges here, without opening a massive hole that would allow any user to talk to the docker socket.

alexclewontin commented 2 months ago

My naive suggestion would be that edglet/contrib/snap/socat.sh could try rm -f $SNAP_COMMON/docker-proxy.sock before listening, which would maybe clean up the issue where it fails to listen because the file already exists. Maybe making aziot-edged after: docker-proxy would also help with the timing issue? Because it's a simple daemon this may or may not be comprehensive; the right way to handle that would probably be to make docker-proxy a notify-type daemon and use systemd-notify to indicate readiness after successfully listening, but there are some issues with snap confinement/PID numbers when calling it from a shell script, so I'd have to play with that to see if I could make it work

CharleeSF commented 2 months ago

@alexclewontin

Ah, thanks for the explanation about why the proxy exists! :) That makes more sense now.

Also, for the last problem I've had, would adding something like this to socat.sh work?

docker_socket="/var/run/docker.sock"

while [ ! -e "$docker_socket" ]; do
    echo "Docker socket ($docker_socket) does not exist yet. Sleeping"
    sleep 1
done

I think that together with the after: docker-proxy might do the trick?

alexclewontin commented 2 months ago

My reservation there is that because docker-proxy is a simple daemon, systemd doesn't know the difference between the wait loop and actively listening on the socket, so even when it enters the wait loop systemd will consider the proxy ready and then try to start aziot-edged. I think I'd rather keep it so socat errors out, because then there's potential for systemd to catch the problem and wait on starting aziot-edged. However that's still a bit racy, depending on how quickly socat errors out vs how quickly systemd starts aziot-edged.

The systemd-notify approach would address that race by waiting for the script to actively affirm that it is indeed ready, after successfully listening on the socket.

CharleeSF commented 2 months ago

I see I see.

Maybe we can consider the daemon retry interval a little longer? I think it is currently very fast and stops after a few times because of it and doesn't recover..

(Currently this retry is also triggered, because aziot-edged fails, but it doesn't recover because the socket becomes available after systemd has given up on restarting it)

CharleeSF commented 2 months ago

To give you an idea of the timeframe... I made a little daemon script that helps me recover from this.

The script:

#!/bin/bash
# This script removes a problematic file preventing snap from starting after reboot.
echo "Running remove-socket.sh"
ls -l /var/snap/azure-iot-edge/common/docker-proxy.sock
echo "Removing /var/snap/azure-iot-edge/common/docker-proxy.sock"
rm -f /var/snap/azure-iot-edge/common/docker-proxy.sock

docker_socket="/var/run/docker.sock"

while [ ! -e "$docker_socket" ]; do
    echo "Docker socket ($docker_socket) does not exist yet. Sleeping"
    sleep 1
done

echo "Sleeping to see if docker-proxy revives itself"
sleep 3

echo "Checking if docker-proxy is in failed state + restart everything if yes"
sudo snap logs azure-iot-edge.docker-proxy | tail -n1 | grep '"/var/run/docker.sock", 22): No such file or directory'
if [ $? -eq 0 ]; then
    echo "Restarting azure-iot-edge because it had failed due to no docker.sock"
    sudo snap restart azure-iot-edge
fi

The output after a reboot:

Apr 17 18:50:49 ubuntu remove-socket.sh[735]: Running remove-socket.sh
Apr 17 18:50:49 ubuntu remove-socket.sh[741]: ls: cannot access '/var/snap/azure-iot-edge/common/docker-proxy.sock': No such file or directory
Apr 17 18:50:49 ubuntu remove-socket.sh[735]: Removing /var/snap/azure-iot-edge/common/docker-proxy.sock
Apr 17 18:50:49 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:50 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:51 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:52 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:53 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:54 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:55 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:56 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:57 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:58 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:59 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:00 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:01 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:02 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:03 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:04 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:05 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:06 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:07 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:08 ubuntu remove-socket.sh[735]: Sleeping to see if docker-proxy revives itself
Apr 17 18:51:11 snape remove-socket.sh[735]: Checking if docker-proxy is in failed state + restart everything if yes
Apr 17 18:51:11 snape remove-socket.sh[2059]: 2024-04-17T18:51:08Z azure-iot-edge.docker-proxy[1668]: 2024/04/17 18:51:08 socat[1668] E connect(5, AF=1 "/var/run/docker.sock", 22): No such file or directory
Apr 17 18:51:11 snape remove-socket.sh[735]: Restarting azure-iot-edge because it had failed due to no docker.sock
Apr 17 18:51:11 snape remove-socket.sh[2074]: Restarted.
alexclewontin commented 2 months ago

Yeah certainly, I think setting the retry interval on at least the proxy, if not both daemons to 1 or more seconds would be a helpful first step

CharleeSF commented 2 months ago

Should I make a PR for that, or do you guys prefer to do it?

KhazAkar commented 1 month ago

Bumping the topic using comment, because it makes azure-iot-edge snap unreliable and requires hacks to circumvent them.

damonbarry commented 4 weeks ago

@CharleeSF Can I get a little more info about the PR you're proposing?

I see a few different ideas proposed here:

Which of these ideas (or others) would your PR contain?

KhazAkar commented 4 weeks ago

@damonbarry eventually just dropping docker-proxy might be possible, see https://ubuntu.com/core/docs/docker-deploy , https://forum.snapcraft.io/t/how-to-use-docker-containers-in-a-snap-application/39177 as starting points. And, of course, dedicated connection https://snapcraft.io/docs/docker-interface

bishal41 commented 3 days ago

@KhazAkar @CharleeSF can this issue be resolved, looks like this is inactive for a while. I will proceed and close this if I don't hear anything by end of week.

KhazAkar commented 2 days ago

@KhazAkar @CharleeSF can this issue be resolved, looks like this is inactive for a while. I will proceed and close this if I don't hear anything by end of week.

I'm here just dropping some useful info, my C#/Rust knowledge is ~0 to implement anything