Jamesits / docker-ripe-atlas

This is the RIPE Atlas software probe packaged as a Docker image.
https://hub.docker.com/r/jamesits/ripe-atlas
GNU General Public License v3.0
151 stars 22 forks source link

Support version 5090 #32

Closed Dreista closed 3 weeks ago

Dreista commented 3 weeks ago

There are some big changes made to RIPE-NCC/ripe-atlas-software-probe. I guess we probably need to add some migration notes? Fixes #31.

brycied00d commented 3 weeks ago

Thanks for the quick fix, @Dreista!

I notice that the volume mounts/paths changed with 5090 -- Does anything need to be done for migration, in order to preserve the existing probe registration?

Edit: Oh and follow-up question, if this is a "breaking change", I think @Jamesits should probably tag it something different besides "latest" or a lot of people running watchtower (on the guidance of README.md) are suddenly going to break in the next day here.

Edit 2: This is what I get for reading first thing in the morning -- you DID mention the migration topic in the PR, I just overlooked it. And thinking @Jamesits might have missed it too prior to merging.

brycied00d commented 3 weeks ago

I can confirm that the probe does break for users running watchtower and following the "latest" tag:

ripe-atlas  | RESULT 9006 done 1724349825 d225d5671cc8 no reginit.vol start registration
ripe-atlas  | /run/ripe-atlas/status/reginit.vol does not exist try new reg
ripe-atlas  | Ping works
ripe-atlas  | start reg
ripe-atlas  | ATLAS registration starting
ripe-atlas  | REASON_FOR_REGISTRATION NEW NO previous state files
ripe-atlas  | REGHOSTS reg03.atlas.ripe.net 193.0.19.246 2001:67c:2e8:11::c100:13f6 reg04.atlas.ripe.net 193.0.19.247 2001:67c:2e8:11::c100:13f7
ripe-atlas  | ssh -p 443 atlas@193.0.19.247 INIT
ripe-atlas  | 255  REGINIT exit with error

Confirmed, a brand new probe_key was generated -- ripe-atlas itself made no migration attempt from the old paths to the new.

/etc/ripe-atlas:
total 32
drwxrwx---  2 ripe-atlas ripe-atlas    7 Aug 22 18:03 .
drwxr-xr-x 34 root       root         79 Aug 22 18:03 ..
-rw-r--r--  1 ripe-atlas ripe-atlas   37 Aug 22 18:03 config.txt
-rw-r--r--  1 ripe-atlas ripe-atlas    5 Aug 22 18:03 mode
-rw-------  1 ripe-atlas ripe-atlas 2590 Aug 22 18:03 probe_key
-rw-r--r--  1 ripe-atlas ripe-atlas  560 Aug 22 18:03 probe_key.pub
-rw-r--r--  1 ripe-atlas ripe-atlas  188 Aug 22 18:03 reg_servers.sh

PS: Looks like README.md and docker-compose.yaml also need updating to reference the new paths.

brycied00d commented 3 weeks ago

I think more testing is needed before promoting ripe-atlas probe 5090 to the "latest" tag

I don't know if this is related to the migration (effectively I did mv /var/atlas-probe/etc /etc/ripe-atlas and mv /var/atlas-probe/status /run/ripe-atlas/status), or if it's an issue with 5090 running with the Debian 11 base, but my probes are having a hard time staying "connected" (according to RIPE) for more than about a minute at a time.

For example, https://atlas.ripe.net/probes/1002242/#tab-network screenshot_2024-08-22_11-45-19

It seems the SSH connection keeps terminating...

ripe-atlas  | no ssh client matching . cleanup state files. for next restart
ripe-atlas  | RESULT 9006 done 1724351290 d225d5671cc8 no reginit.vol start registration
ripe-atlas  | /run/ripe-atlas/status/reginit.vol does not exist try new reg
ripe-atlas  | Ping works
ripe-atlas  | start reg
ripe-atlas  | ATLAS registration starting
ripe-atlas  | registration info is still valid till 1724353887, now 1724351293
ripe-atlas  | check cached controller info from previous registration
ripe-atlas  | Use cached controller info -R 53391 atlas@ctr-hel08.atlas.ripe.net
ripe-atlas  | initiating  KEEP connection to -R 53391 -p  443 ctr-hel08.atlas.ripe.net
ripe-atlas  | condmv: not moving, destination '/var/spool/ripe-atlas/data/out/v6addr.txt' exists
ripe-atlas  | condmv: not moving, destination '/var/spool/ripe-atlas/data/out/simpleping' exists
ripe-atlas  | no ssh client matching . cleanup state files. for next restart
ripe-atlas  | RESULT 9006 done 1724351476 d225d5671cc8 no reginit.vol start registration
ripe-atlas  | /run/ripe-atlas/status/reginit.vol does not exist try new reg
ripe-atlas  | Ping works
ripe-atlas  | start reg
ripe-atlas  | ATLAS registration starting
ripe-atlas  | registration info is still valid till 1724353887, now 1724351480
ripe-atlas  | check cached controller info from previous registration
ripe-atlas  | Use cached controller info -R 53391 atlas@ctr-hel08.atlas.ripe.net
ripe-atlas  | initiating  KEEP connection to -R 53391 -p  443 ctr-hel08.atlas.ripe.net
ripe-atlas  | condmv: not moving, destination '/var/spool/ripe-atlas/data/out/v6addr.txt' exists
ripe-atlas  | condmv: not moving, destination '/var/spool/ripe-atlas/data/out/simpleping' exists
ripe-atlas  | no ssh client matching . cleanup state files. for next restart
ripe-atlas  | RESULT 9006 done 1724351660 d225d5671cc8 no reginit.vol start registration
ripe-atlas  | /run/ripe-atlas/status/reginit.vol does not exist try new reg
ripe-atlas  | Ping works
ripe-atlas  | start reg
ripe-atlas  | ATLAS registration starting
ripe-atlas  | registration info is still valid till 1724353887, now 1724351664
ripe-atlas  | check cached controller info from previous registration
ripe-atlas  | Use cached controller info -R 53391 atlas@ctr-hel08.atlas.ripe.net
ripe-atlas  | initiating  KEEP connection to -R 53391 -p  443 ctr-hel08.atlas.ripe.net
ripe-atlas  | condmv: not moving, destination '/var/spool/ripe-atlas/data/out/v6addr.txt' exists
ripe-atlas  | condmv: not moving, destination '/var/spool/ripe-atlas/data/out/simpleping' exists
ripe-atlas  | no ssh client matching . cleanup state files. for next restart
ripe-atlas  | RESULT 9006 done 1724351845 d225d5671cc8 no reginit.vol start registration
ripe-atlas  | /run/ripe-atlas/status/reginit.vol does not exist try new reg
ripe-atlas  | Ping works
ripe-atlas  | start reg
ripe-atlas  | ATLAS registration starting
ripe-atlas  | registration info is still valid till 1724353887, now 1724351848
ripe-atlas  | check cached controller info from previous registration
ripe-atlas  | Use cached controller info -R 53391 atlas@ctr-hel08.atlas.ripe.net
ripe-atlas  | initiating  KEEP connection to -R 53391 -p  443 ctr-hel08.atlas.ripe.net
ripe-atlas  | condmv: not moving, destination '/var/spool/ripe-atlas/data/out/v6addr.txt' exists
ripe-atlas  | condmv: not moving, destination '/var/spool/ripe-atlas/data/out/simpleping' exists
ripe-atlas  | no ssh client matching . cleanup state files. for next restart
ripe-atlas  | RESULT 9006 done 1724352029 d225d5671cc8 no reginit.vol start registration
ripe-atlas  | /run/ripe-atlas/status/reginit.vol does not exist try new reg
ripe-atlas  | Ping works
ripe-atlas  | start reg
ripe-atlas  | ATLAS registration starting
ripe-atlas  | registration info is still valid till 1724353887, now 1724352033
ripe-atlas  | check cached controller info from previous registration
ripe-atlas  | Use cached controller info -R 53391 atlas@ctr-hel08.atlas.ripe.net
ripe-atlas  | initiating  KEEP connection to -R 53391 -p  443 ctr-hel08.atlas.ripe.net
ripe-atlas  | condmv: not moving, destination '/var/spool/ripe-atlas/data/out/v6addr.txt' exists
ripe-atlas  | condmv: not moving, destination '/var/spool/ripe-atlas/data/out/simpleping' exists
ripe-atlas  | no ssh client matching . cleanup state files. for next restart
ripe-atlas  | RESULT 9006 done 1724352213 d225d5671cc8 no reginit.vol start registration
ripe-atlas  | /run/ripe-atlas/status/reginit.vol does not exist try new reg
ripe-atlas  | Ping works
ripe-atlas  | start reg
ripe-atlas  | ATLAS registration starting
ripe-atlas  | registration info is still valid till 1724353887, now 1724352217
ripe-atlas  | check cached controller info from previous registration
ripe-atlas  | Use cached controller info -R 53391 atlas@ctr-hel08.atlas.ripe.net
ripe-atlas  | initiating  KEEP connection to -R 53391 -p  443 ctr-hel08.atlas.ripe.net
ripe-atlas  | condmv: not moving, destination '/var/spool/ripe-atlas/data/out/v6addr.txt' exists
ripe-atlas  | condmv: not moving, destination '/var/spool/ripe-atlas/data/out/simpleping' exists

And ssh_err.txt shows an error indicating that no known_hosts entry was matched.

$ cat /run/ripe-atlas/status/ssh_err.txt
RSA host key for IP address '2a01:4f9:4a:454e::2' not in list of known hosts.
client_loop: send disconnect: Broken pipe

The host being referenced (ctr-hel08.atlas.ripe.net) definitely appears in the known_hosts file that RIPE's using, though not the bare IPv6 address.

grep ctr-hel08.atlas.ripe.net /run/ripe-atlas/status/known_hosts
ctr-hel08.atlas.ripe.net ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCZ2YRfPRQwhVrDJRM+uYGUokSvQITY5ueJPrQDWMWfJz7v9XU+2RrcWKyLkp/ppsfny3UTmhdQQZz5PswXx/YTvw0Ad86woTkhuULp8pD8qwVMCvWNp/Aj/6/fvbwiwIjQd0+ThgGVA43gelJSld/+xOH8h4VHXGQt/BP8IoB6IwHB7of1PVqwTu/tqbBM0N25bqOfR/Zviz2O6T7SXfJOrtgyKAwTVfEiNcgEaW+abRYVAbsNRfwV2VlMWRmssT/XF51crK491t0X72SVsHcpL6TQ/KeXyIkmzu3pXq+/wGr/OyB5FKuL3KOFSksoY5lqZO/4h9VqTe8s9fGOMkRR
ipv4.ctr-hel08.atlas.ripe.net ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCZ2YRfPRQwhVrDJRM+uYGUokSvQITY5ueJPrQDWMWfJz7v9XU+2RrcWKyLkp/ppsfny3UTmhdQQZz5PswXx/YTvw0Ad86woTkhuULp8pD8qwVMCvWNp/Aj/6/fvbwiwIjQd0+ThgGVA43gelJSld/+xOH8h4VHXGQt/BP8IoB6IwHB7of1PVqwTu/tqbBM0N25bqOfR/Zviz2O6T7SXfJOrtgyKAwTVfEiNcgEaW+abRYVAbsNRfwV2VlMWRmssT/XF51crK491t0X72SVsHcpL6TQ/KeXyIkmzu3pXq+/wGr/OyB5FKuL3KOFSksoY5lqZO/4h9VqTe8s9fGOMkRR
ipv6.ctr-hel08.atlas.ripe.net ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCZ2YRfPRQwhVrDJRM+uYGUokSvQITY5ueJPrQDWMWfJz7v9XU+2RrcWKyLkp/ppsfny3UTmhdQQZz5PswXx/YTvw0Ad86woTkhuULp8pD8qwVMCvWNp/Aj/6/fvbwiwIjQd0+ThgGVA43gelJSld/+xOH8h4VHXGQt/BP8IoB6IwHB7of1PVqwTu/tqbBM0N25bqOfR/Zviz2O6T7SXfJOrtgyKAwTVfEiNcgEaW+abRYVAbsNRfwV2VlMWRmssT/XF51crK491t0X72SVsHcpL6TQ/KeXyIkmzu3pXq+/wGr/OyB5FKuL3KOFSksoY5lqZO/4h9VqTe8s9fGOMkRR

So my thinking is that the version of OpenSSH used in this container (OpenSSH_8.4p1 Debian-5+deb11u3, OpenSSL 1.1.1w 11 Sep 2023) doesn't "like" the format known_hosts file that RIPE's distributing; I know there have been some minor changes over the years, optionally hashing hosts etc.

What if this image was bumped to Debian 12? Or should this be raised upstream?

In any case, I think promoting 5090 to "latest" was premature.

Dreista commented 3 weeks ago

Weird, I haven't experienced disconnection myself and your probe (1002242) stays connected for 1h 24m. I'm not sure what's going on, did you change anything?

And yes, I will definitely update the forgotten config examples.

brycied00d commented 3 weeks ago

I downgraded my probe (1002242) to the last build with probe 5080 (sha256:5edf3e9a47d0f28ad743b9dd879813ca75eace423e87f9fd28d4139689031e2c) and it's been completely stable. I do recognize that RIPE's page says it's still running 5090 and I can offer no explanation as to that.

I left another probe of mine (1002591) running the "latest" tag and it's continuing to flap: https://atlas.ripe.net/probes/1002591/#tab-network

bjo81 commented 3 weeks ago

My probe 1003108 got autoupdated via podman 8 hours ago and I got a mail it's disconnected.

Dreista commented 3 weeks ago

Images on Docker Hub has been reverted to 5080 as a temporary fix while we investigating the disconnection issue in 5090.

Also, RIPE decided to change directory structure in version 5090, so if we don't do anything to keep backward compatibility it is expected to break when container gets auto-updated since probe software won't be able to find private and public key in new directory, and users have to mount these keys to new location manually.

brycied00d commented 3 weeks ago

Thanks @Dreista -- Confirmed that jamesits/ripe-atlas@sha256:44e05a1c3a14d7f8f0d283d217aa5fd73f0ff3fdee2b9c4f28abe655b125bef7 (the "latest" pushed 13 hours ago as of writing) with 5080 is connected and stable. Please feel free to ping me if you need testing of 5090-based containers.