Open CharleeSF opened 3 months ago
Also see the discussion here: https://forum.snapcraft.io/t/possible-to-run-scripts-as-root-on-startup-on-ubuntu-core/39580/7
And the comment by mborzecki1:
I briefly looked at the snap. The docker-proxy service should really be made a socket activated service, then systemd starts listening on the socket and that nonsense of unlink/reuseaddr is gone. Unfortunately, you’d need to add code to hand over the file descriptor provided by systemd to socat. Other servies which may use what docker-proxy provides should use after: [docker-proxy] in their declarations. This needs to be fixed in snap packaging, so there’s not much you can do, except for opening some PRs or filing bugs if you’re simply consuming the snap.
Hi @CharleeSF I tried to repro locally and couldn't, and we haven't seen this in our internal testing. But I see from the forum thread that the problem is fairly well understood. @alexclewontin can you comment on the suggested approach in the forum thread?
Hey @damonbarry, thanks for looking into this!
I am also not always able to reproduce it. The strange thing is, once I can reproduce it, it happens on every reboot.. But not every setup has it.
Since the setup takes quite a lot of time I haven't tried to get a 100% reproduction scenario.
I have however, also seen that the docker-proxy fails due to /var/run/docker.sock not being available yet.
May I ask why edged doesn't talk to /var/run/docker.sock directly?
Hey @damonbarry,
Is there any progress on this? I just wanted to mention that I also regularly see that docker seems slower/later in booting than azure-iot-edge, resulting in behavior like this:
2024-04-17T17:51:26Z systemd[1]: Started Service for snap application azure-iot-edge.aziot-edged.
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1825]: Making /var/run/iotedge if it does not exist
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1825]: Successfully made /var/run/iotedge if it did not exist
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Starting Azure IoT Edge Daemon
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Version - 1.4.33
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Obtaining Edge device provisioning data...
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Device is SnapeQemuAuto2 on iq-shared-0-8ccd7.azure-devices.net
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Initializing module runtime...
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Using runtime network id azure-iot-edge
2024-04-17T17:51:26Z azure-iot-edge.docker-proxy[1833]: 2024/04/17 17:51:26 socat[1833] E connect(5, AF=1 "/var/run/docker.sock", 22): No such file or directory
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [WARN] - container runtime error
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: Caused by:
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: channel closed
Further supporting a revision of the docker-proxy behavior.. (in the snapcraft forum thread a solution is suggested for this, by waiting for the /var/run/docker.sock to become available).
Restarting the docker-proxy daemon fixes the issue, but as mentioned before, I don't want to have to do anything on my device for it to boot properly. I think a slower interval between restarting the daemon of docker-proxy would also help. It seems to do 8 retries but they are all before docker has made the socket available.
I am testing azure-iot-edge with quite heavy workloads, maybe that's why docker boots slower?
I can weigh in more eventually, but quickly hopping in to provide some context on why the proxy exists and iotedge doesn't talk to docker directly:
The issue is that in all-snap environments (i.e. Ubuntu Core) docker is provided as a snap, and there is no docker group, so you essentially cannot talk to docker.sock if you are running as UID != 0. aziot-edged runs as user snap_aziotedge and so the docker proxy runs as root in the context of the iotedge snap, but provides the proxy socket with snap_aziotedge ownership to let aziot-edged "escalate" its privileges here, without opening a massive hole that would allow any user to talk to the docker socket.
My naive suggestion would be that edglet/contrib/snap/socat.sh
could try rm -f $SNAP_COMMON/docker-proxy.sock
before listening, which would maybe clean up the issue where it fails to listen because the file already exists. Maybe making aziot-edged
after: docker-proxy
would also help with the timing issue? Because it's a simple daemon this may or may not be comprehensive; the right way to handle that would probably be to make docker-proxy
a notify-type daemon and use systemd-notify
to indicate readiness after successfully listening, but there are some issues with snap confinement/PID numbers when calling it from a shell script, so I'd have to play with that to see if I could make it work
@alexclewontin
Ah, thanks for the explanation about why the proxy exists! :) That makes more sense now.
Also, for the last problem I've had, would adding something like this to socat.sh work?
docker_socket="/var/run/docker.sock"
while [ ! -e "$docker_socket" ]; do
echo "Docker socket ($docker_socket) does not exist yet. Sleeping"
sleep 1
done
I think that together with the after: docker-proxy
might do the trick?
My reservation there is that because docker-proxy is a simple daemon, systemd doesn't know the difference between the wait loop and actively listening on the socket, so even when it enters the wait loop systemd will consider the proxy ready and then try to start aziot-edged. I think I'd rather keep it so socat errors out, because then there's potential for systemd to catch the problem and wait on starting aziot-edged. However that's still a bit racy, depending on how quickly socat errors out vs how quickly systemd starts aziot-edged.
The systemd-notify approach would address that race by waiting for the script to actively affirm that it is indeed ready, after successfully listening on the socket.
I see I see.
Maybe we can consider the daemon retry interval a little longer? I think it is currently very fast and stops after a few times because of it and doesn't recover..
(Currently this retry is also triggered, because aziot-edged fails, but it doesn't recover because the socket becomes available after systemd has given up on restarting it)
To give you an idea of the timeframe... I made a little daemon script that helps me recover from this.
The script:
#!/bin/bash
# This script removes a problematic file preventing snap from starting after reboot.
echo "Running remove-socket.sh"
ls -l /var/snap/azure-iot-edge/common/docker-proxy.sock
echo "Removing /var/snap/azure-iot-edge/common/docker-proxy.sock"
rm -f /var/snap/azure-iot-edge/common/docker-proxy.sock
docker_socket="/var/run/docker.sock"
while [ ! -e "$docker_socket" ]; do
echo "Docker socket ($docker_socket) does not exist yet. Sleeping"
sleep 1
done
echo "Sleeping to see if docker-proxy revives itself"
sleep 3
echo "Checking if docker-proxy is in failed state + restart everything if yes"
sudo snap logs azure-iot-edge.docker-proxy | tail -n1 | grep '"/var/run/docker.sock", 22): No such file or directory'
if [ $? -eq 0 ]; then
echo "Restarting azure-iot-edge because it had failed due to no docker.sock"
sudo snap restart azure-iot-edge
fi
The output after a reboot:
Apr 17 18:50:49 ubuntu remove-socket.sh[735]: Running remove-socket.sh
Apr 17 18:50:49 ubuntu remove-socket.sh[741]: ls: cannot access '/var/snap/azure-iot-edge/common/docker-proxy.sock': No such file or directory
Apr 17 18:50:49 ubuntu remove-socket.sh[735]: Removing /var/snap/azure-iot-edge/common/docker-proxy.sock
Apr 17 18:50:49 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:50 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:51 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:52 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:53 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:54 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:55 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:56 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:57 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:58 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:59 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:00 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:01 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:02 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:03 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:04 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:05 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:06 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:07 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:08 ubuntu remove-socket.sh[735]: Sleeping to see if docker-proxy revives itself
Apr 17 18:51:11 snape remove-socket.sh[735]: Checking if docker-proxy is in failed state + restart everything if yes
Apr 17 18:51:11 snape remove-socket.sh[2059]: 2024-04-17T18:51:08Z azure-iot-edge.docker-proxy[1668]: 2024/04/17 18:51:08 socat[1668] E connect(5, AF=1 "/var/run/docker.sock", 22): No such file or directory
Apr 17 18:51:11 snape remove-socket.sh[735]: Restarting azure-iot-edge because it had failed due to no docker.sock
Apr 17 18:51:11 snape remove-socket.sh[2074]: Restarted.
Yeah certainly, I think setting the retry interval on at least the proxy, if not both daemons to 1 or more seconds would be a helpful first step
Should I make a PR for that, or do you guys prefer to do it?
Bumping the topic using comment, because it makes azure-iot-edge snap unreliable and requires hacks to circumvent them.
@CharleeSF Can I get a little more info about the PR you're proposing?
I see a few different ideas proposed here:
after: docker-proxy
in aziot-edgeddaemon: simple
to daemon: notify
and use systemd-notify to indicate readinessWhich of these ideas (or others) would your PR contain?
@damonbarry eventually just dropping docker-proxy might be possible, see https://ubuntu.com/core/docs/docker-deploy , https://forum.snapcraft.io/t/how-to-use-docker-containers-in-a-snap-application/39177 as starting points. And, of course, dedicated connection https://snapcraft.io/docs/docker-interface
@KhazAkar @CharleeSF can this issue be resolved, looks like this is inactive for a while. I will proceed and close this if I don't hear anything by end of week.
@KhazAkar @CharleeSF can this issue be resolved, looks like this is inactive for a while. I will proceed and close this if I don't hear anything by end of week.
I'm here just dropping some useful info, my C#/Rust knowledge is ~0 to implement anything
Expected Behavior
If I have setup and installed azure-iot-edge snap I should be able to reboot the device and azure-iot-edge should start without any issues.
Current Behavior
azure-iot-edge fails to start with the following logs:
Steps to Reproduce
Provide a detailed set of steps to reproduce the bug.
I verify if it is working by checking the logs
Context (Environment)
snap changes
Output of
iotedge check
Seems to be OK other than docker-proxy.sock not being accesible, which is the bug I'm reporting
``` charlee@ubuntucore:~$ sudo iotedge check Configuration checks (aziot-identity-service) --------------------------------------------- √ keyd configuration is well-formed - OK √ certd configuration is well-formed - OK √ tpmd configuration is well-formed - OK √ identityd configuration is well-formed - OK √ daemon configurations up-to-date with config.toml - OK √ identityd config toml file specifies a valid hostname - OK √ aziot-identity-service package is up-to-date - OK √ host time is close to reference time - OK √ preloaded certificates are valid - OK √ keyd is running - OK √ certd is running - OK √ identityd is running - OK √ read all preloaded certificates from the Certificates Service - OK √ read all preloaded key pairs from the Keys Service - OK √ check all EST server URLs utilize HTTPS - OK √ ensure all preloaded certificates match preloaded private keys with the same ID - OK Connectivity checks (aziot-identity-service) -------------------------------------------- √ host can connect to and perform TLS handshake with iothub AMQP port - OK √ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK √ host can connect to and perform TLS handshake with iothub MQTT port - OK Configuration checks -------------------- √ aziot-edged configuration is well-formed - OK √ configuration up-to-date with config.toml - OK × container engine is installed and functional - Error Could not communicate with container engine at unix:///var/snap/azure-iot-edge/common/docker-proxy.sock. Please check your moby-engine installation and ensure the service is running. √ aziot-edge package is up-to-date - OK ‼ DNS server - Warning Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub. Please see https://aka.ms/iotedge-prod-checklist-dns for best practices. You can ignore this warning if you are setting DNS server per module in the Edge deployment. ‼ production readiness: logs policy - Warning Container engine is not configured to rotate module logs which may cause it run out of disk space. Please see https://aka.ms/iotedge-prod-checklist-logs for best practices. You can ignore this warning if you are setting log policy per module in the Edge deployment. √ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK Connectivity checks ------------------- 23 check(s) succeeded. 2 check(s) raised warnings. Re-run with --verbose for more details. 1 check(s) raised errors. Re-run with --verbose for more details. 11 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details. ```Device Information
Runtime Versions
Workaround
sudo rm /var/snap/azure-iot-edge/common/docker-proxy.sock && sudo snap restart azure-iot-edge
Notes
Not sure if it is reproducible 100% of the time, but it occurs often enough to be a problem. Especially since my device is supposed to be able to boot without human interaction.