Opentrons / buildroot

The Opentrons fork of buildroot for building the OT2 system. Our default branch is opentrons-develop.
http://buildroot.org
Other
10 stars 7 forks source link

Can't run cloudflared on OT-2 #189

Open arogozhnikov opened 1 year ago

arogozhnikov commented 1 year ago

cloudflared is a communication utility that connects from/to an internal network of the company. It provides zero-trust connection from the internet.

It supports pretty much any OS of any distribution.

However, buildroot locks systemd for editing, and when cloudflared tries to install service, I run into this problem:

2022-11-30T16:46:41Z INF Using Systemd
2022-11-30T16:46:41Z ERR error generating service template error="error writing /etc/systemd/system/cloudflared.service: open /etc/systemd/system/cloudflared.service: read-only file system"
error writing /etc/systemd/system/cloudflared.service: open /etc/systemd/system/cloudflared.service: read-only file system

Are there any tools to override this? I saw the discussion in #106 about allowing customers to use services.

sfoster1 commented 1 year ago

The root filesystem (which includes /usr, /lib, /bin, and /etc which is where default systemd units are stored) in the ot-2 is mounted read-only to prevent modification. This is because that root filesystem is completely overwritten during update - so if you put a systemd service in /etc, whenever you update your robot it will be gone.

/home, /var, and /data are on a separate partition that does not get overwritten during update. That means that the user unit search path /home/.config/systemd/user/ is a good place to put custom systemd units - try there?

arogozhnikov commented 1 year ago

@sfoster1

Hi Seth,

That means that the user unit search path /home/.config/systemd/user/ is a good place to put custom systemd units - try there?

That's what I get:

mkdir: can't create directory '/home/.config/': Read-only file system

so I assume that's not going to work.

Now, I've tried setting up the way you described in #106

  • A systemd service unit
  • A directory /var/home/.config/systemd/user/opentrons.target.wants that includes a symlink to your service

and couldn't make systemctl see the service:

~ # cat /root/tunnel.service
[Unit]
Description=Static tunnel to ssh into machine
After=basic.target

[Service]
Type=exec
ExecStart=/root/cloudflared tunnel --config /root/tunnel_conf.yaml --protocol=quic  run

[Install]
WantedBy=opentrons.target
~ # ls /var/home/.config/systemd/user/opentrons.target.wants -lah
total 2
drwxr-xr-x    2 root     root        1.0K Dec  1 23:19 .
drwxr-xr-x    3 root     root        1.0K Dec  1 23:18 ..
lrwxrwxrwx    1 root     root          20 Dec  1 23:19 tunnel.service -> /root/tunnel.service
systemctl daemon-reload
# this command returns nothing, and I see nothing relevant when listing either
systemctl list-units --type=service --all | grep tunnel
arogozhnikov commented 1 year ago

remark: open to using any option (i.e. not necessary systemd) that can launch process on boot as daemons

sfoster1 commented 1 year ago

remark: open to using any option (i.e. not necessary systemd) that can launch process on boot as daemons

Oh! In that case, we support boot scripts run with run-parts. Drop an executable shell script named NN-some-ascii-text where NN is a number (this is mostly a convention - the only rules are that it has to be ascii letters, numbers, or - and _) and it'll get run at boot:

# cat /var/data/boot.d/00-demo 
echo "my service ran"
touch /var/data/my-service-ran
# ls -l /var/data/boot.d/00-demo 
-rwxr-xr-x    1 root     root            53 Dec  2 14:13 /var/data/boot.d/00-demo
# reboot
# # ls -l /var/data/
total 310
(other results removed for clarity)
-rw-r--r--    1 root     root             0 Dec  2 14:14 my-service-ran

# journalctl -u opentrons-run-boot-scripts --no-pager
-- Logs begin at Fri 2018-06-22 11:11:49 UTC, end at Fri 2022-12-02 14:16:05 UTC. --
-- Reboot --
Dec 02 14:14:21 opentrons run-parts[162]: my service ran
Dec 02 14:14:21 opentrons systemd[1]: Starting Opentrons: Run user-supplied boot scripts...
Dec 02 14:14:21 opentrons systemd[1]: Started Opentrons: Run user-supplied boot scripts.
sfoster1 commented 1 year ago

One thing to keep in mind is that run-parts scripts are unfortunately a lot less configurable than systemd services. You might know how to do this stuff better than me, but those scripts all want to execute like a systemd oneshot service - the script runs once and then exits. That means you really need cloudflared to daemonize (fork and abandon its parent) when called on the commandline - there might be a -d,--daemonize command line flag, or maybe just the absence of a --foreground flag or something, I'm not familiar with cloudflared and can't find a good reference for its command line params.

arogozhnikov commented 1 year ago

@sfoster1 nice, does run-parts just assumes these files are shell scripts?

arogozhnikov commented 1 year ago

asking because there is no shebang in your example

sfoster1 commented 1 year ago

Ah, yes it does. It runs them through the shell.

arogozhnikov commented 1 year ago

@sfoster1 likely I'm doing something wrong, but the service isn't started during reboot:

Location:

~ # ls /var/data/boot.d/00-cftunnel -lah
-rw-r--r--    1 root     root         375 Dec  2 17:27 /var/data/boot.d/00-cftunnel

In log, nothing shows it was found or called:

Dec 15 18:31:03 opentrons ot-commit-machine-id[164]: machine-id "05a2d52f19ca460a9f87f944c6532461" already committed. Exiting without doing anything.
Dec 15 18:31:02 opentrons systemd[1]: Starting Jupyter notebook server...
Dec 15 18:31:02 opentrons systemd[1]: Starting Opentrons: Run user-supplied boot scripts...
Dec 15 18:31:02 opentrons systemd[1]: Starting Network Connectivity...
Dec 15 18:31:02 opentrons systemd[1]: Starting Opentrons: Ensure system wired connections...
Dec 15 18:31:02 opentrons systemd[1]: Starting Rerun udev for block devices...
Dec 15 18:31:02 opentrons systemd[1]: Started D-Bus System Message Bus.

Contents of file:

# cat /var/data/boot.d/00-cftunnel

echo "starting cloudflared tunnel"
echo -n $(date -u) >> /data/tunnel.log
echo "starting cloudflared tunnel" >> /root/tunnel.log
tmux kill-session -t ot-tunnel-session || (echo 'no tmux session to stop' >> /root/tunnel.log)
<actual cloudflared command goes here>

Update: Command that you suggested:

-- Reboot --
Dec 15 18:31:02 opentrons systemd[1]: Starting Opentrons: Run user-supplied boot scripts...
Dec 15 18:31:02 opentrons systemd[1]: Started Opentrons: Run user-supplied boot scripts.
sfoster1 commented 1 year ago

@arogozhnikov Mark it executable: chmod u+x /var/data/boot.d/00-cftunnel

arogozhnikov commented 1 year ago

@sfoster1 I think I've tried everything and cloudflare just can't run at this point in boot process. I am not 100% sure, but here is what I have:

  1. /var/data/boot.d/00-cftunnel runs at startup
  2. I place several commands inside, and they run
  3. I additionally place a simple echo wrapped in tmux to verify that tmux server can be started from boot.d
  4. if I just source /var/data/boot.d/00-cftunnel, tunnel is started normally.

I do not see any logs or errors from cloudflared. Adding sleep 60 before running cloudflared did not help either

Any other ideas?

sfoster1 commented 1 year ago

Well huh. A lot of my ideas are broken by cloudflared working fine if you source /var/data/boot.d/00-cftunnel. I assume you're doing something like their setup docs with a config file somewhere on the OT-2 filesystem that you're passing the path to in /var/data/boot.d/00-cftunnel, right?

Where is that config file on the OT-2 filesystem? I wonder if there's some problem like that part of the filesystem not being mounted at the time you run 00-cftunnel. And putting sleep 60 in there wouldn't necessarily fix it because runparts and the systemd unit it's in would just see that as the script taking a long time and delay starting whatever depends on it.

Where on the OT-2 filesystem did you put the cftunnel binary+supporting solibs and config file?

arogozhnikov commented 1 year ago

I place everything (binary, config, logs) right under /root

/root/cloudflared tunnel --config /root/tunnel_conf.yaml --protocol=quic --logfile /root/tunnel.log run > /root/tunnel_last_start.log 2>&1
sfoster1 commented 1 year ago

And then there's nothing in /root/tunnel.log or /root/tunnel_last_start.log when you ssh in after boot, right?

I'm really not sure what in the world is going wrong but one thing we could try is your idea to wait some time before starting the service, but do it in a fork'd child of the run-parts script. What you'd want to do is the following:

  1. Create a new script somewhere, let's say /var/data/boot.d/cftunnel-worker, with chmod +x and a bash shebang, and put basically everything that's currently in 00-cftunnel in there, including an initial 60-second sleep
  2. Make 00-cftunnel only do the following: nohup /var/data/boot.d/cftunnel-worker 0<&- &>/dev/null &\ This should do something similar to daemonize(1) which is not available on the ot2. Ignore this next part if you already know what it means, but it basically creates a child process and then severs the child process's connection to the parent process so the child can run forever in the background.

So if the problem we're facing is (1) system resources aren't ready enough at the time runparts runs so cftunnel can't start and (2) doing a sleep 60 in the runparts script just means that systemd delays bringing up those parts of the system until the script is done, this should solve it by avoiding (2).

arogozhnikov commented 1 year ago

It is not /root not mounted, but something with network, I assume. Also there is probably something around tmux + cf used together

your solution (nohup + delay) seems to work. Need more tests to be sure about that, but at least it restarted successfully twice

Delay is critical, otherwise I get this in logs:

{"level":"warn","error":"Group ID 0 is not between ping group 1 to 0","time":"2023-06-14T21:12:47Z","message":"The user running cloudflared process has a GID (group ID) that is not within ping_group_range. You might need to add that user to a group within that range, or instead update the range to encompass a group the user is already in by modifying /proc/sys/net/ipv4/ping_group_range. Otherwise cloudflared will not be able to ping this network"}
{"level":"warn","error":"cannot create ICMPv4 proxy: Group ID 0 is not between ping group 1 to 0 nor ICMPv6 proxy: socket: permission denied","time":"2023-06-14T21:12:47Z","message":"ICMP proxy feature is disabled"}
{"level":"error","event":0,"error":"lookup _v2-origintunneld._tcp.argotunnel.com on [2001:4860:4860::8888]:53: dial udp [2001:4860:4860::8888]:53: connect: cannot assign requested address","time":"2023-06-14T21:12:47Z","message":"edge discovery: error looking up Cloudflare edge IPs: the DNS query failed"}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"Please try the following things to diagnose this issue:"}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"  1. ensure that argotunnel.com is returning \"origintunneld\" service records."}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"     Run your system's equivalent of: dig srv _origintunneld._tcp.argotunnel.com"}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"  2. ensure that your DNS resolver is not returning compressed SRV records."}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"     See GitHub issue https://github.com/golang/go/issues/27546"}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"     For example, you could use Cloudflare's 1.1.1.1 as your resolver:"}
{"level":"error","event":0,"time":"2023-06-14T21:12:47Z","message":"     https://developers.cloudflare.com/1.1.1.1/setting-up-1.1.1.1/"}
{"level":"info","time":"2023-06-14T21:12:47Z","message":"ICMP proxy will use 0.0.0.0 as source for IPv4"}
{"level":"info","time":"2023-06-14T21:12:47Z","message":"ICMP proxy will use :: as source for IPv6"}
sfoster1 commented 1 year ago

Ah, I guess it's not designed to handle "I'm not currently network-connected" or something. Well, I'm glad the nohup plus delay works! Let me know if something fails in those further tests - I'll leave this open for another couple days.