Closed gibmat closed 5 months ago
Thanks, taking a look.
update 1: Cannot reproduce on Ubuntu 22.04 host with 6.2 kernel running the test in an LXD sid container nor in a bookworm container.
Aha, I can reproduce if I block my container from accessing the outside world. my container isn't assigned an IPv4 address (which is fortunately an issue I ran into by accident, but allows to reproduce this problem). If I disable my host firewall my LXD container is assigned an IPv4 address and the tests succeed. Not sure if this gives you something useful to work with ... don't know yet if this is a problem on our side.
Ah, that's interesting, and reflects my setup. The containers I spin up for packaging work are on an IPv6-only network segment, although they do have both IPv4 and IPv6 addresses setup on the loopback interface. My bookworm machine has both addresses, and the QEMU VM only has IPv4 due to the simple NAT I configured.
I haven't looked into how the tests are trying to bind to addresses -- are they assuming there's an IPv4 address they can use on some "real" (non-loopback) interface?
@gibmat Thanks for the report. Just for some context -- which release of libraft are you upgrading from, and do the same tests pass in the same containerized environment when run against that earlier version?
It's entirely possible that we have some IPv4-related implicit assumptions in our networking code; I'm looking into what the specific problem might be.
So at least some of the failures here are caused by failing to resolve 127.0.0.1:9001
, which is hardcoded for the tests here:
https://github.com/canonical/raft/blob/b68076fdc7d34a901cc2eb5eb480832c709b2532/test/lib/uv.h#L51
We could conceivably make it possible to run the tests using a different address/port.
(At least, I was able to reproduce the same pattern of test failures by doing ip addr del
on the assigned IPv4 address inside a LXD container and running the test suite there; and I traced some of the test failures in that case to failing to a failed call to getaddrinfo against 127.0.0.1:9001
inside uvIpResolveBindAddresses.)
More generally, the code that implements the libuv TCP transport for raft is not IPv6-capable right now.
The reason that 127.0.0.1
can't be resolved here despite the existence of the loopback interface is that we pass the AI_ADDRCONFIG
flag to getaddrinfo.
When trying to build the raft 0.18.0 for Fedora 37 or 38 via namespaced mock buildroot I'm facing the same behaviour as @gibmat. The tests remain stuck at:
address=localhost:9000, bind-address=localhost:9000 [ ERROR ]
Error: test/integration/test_uv_tcp_listen.c:211: assertion failed: rv == 0 (18 == 0)
Error: child killed by signal 6 (Aborted)
address=localhost:9000, bind-address=:9000 [ ERROR ]
Error: test/integration/test_uv_tcp_listen.c:211: assertion failed: rv == 0 (18 == 0)
Error: child killed by signal 6 (Aborted)
address=localhost:9000, bind-address=0.0.0.0:9000 [ ERROR ]
Error: test/integration/test_uv_tcp_listen.c:211: assertion failed: rv == 0 (18 == 0)
Error: child killed by signal 6 (Aborted)
tcp_connect/closeDuringSecondConnect [ OK ] [ 0.00023986 / 0.00029068 CPU ]
tcp_connect/closeDuringConnectAbort
The build environment only has a loopback device:
<mock-chroot> sh-5.2# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
The exact same build works fine for raft 0.17.1.
Thanks for the report @ganto. I have an in-progress PR to rework how we call getaddrinfo in uv_ip.c that will hopefully enable running the test suite smoothly when there is no non-loopback IPv4 address available.
The exact same build works fine for raft 0.17.1.
Can you confirm this please, because I can reproduce this behavior from 0.16.0
(incl.) onwards.
Summary of where I think the problems are and how to approach fixing them:
There are two calls to getaddrinfo in the main codebase (note that uv_getaddrinfo just calls getaddrinfo on the thread pool):
https://github.com/canonical/raft/blob/b68076fdc7d34a901cc2eb5eb480832c709b2532/src/uv_ip.c#L57-L85 https://github.com/canonical/raft/blob/b68076fdc7d34a901cc2eb5eb480832c709b2532/src/uv_tcp_connect.c#L293-L294
The first one is used to implement UvTcpListen, the second to implement UvTcpConnect.
AI_ADDRCONFIG
flag. The upshot of this is that on a system with no configured non-loopback IPv4 addresses, no IPv4 addresses will be returned from getaddrinfo. Because we pass AF_INET to getaddrinfo, no addresses will be returned at all in this case. This explains many of the test failures.AI_ADDRCONFIG
from the getaddrinfo calls, because I don't fully understand the implications for downstream. Based on the man page it sounds like without this flag "bogus" addrinfos will be returned in some cases, and then raft would fail (fatally) trying to bind to those addresses.[::1]:9001
(etc.) if that's appropriate. We already have parsing code in dqlite that can be adapted for this.getaddrinfo("::1", "9001", AF_UNSPEC)
will return both IPv4 and IPv6 addresses that conflict (EADDRINUSE), and on the server side we try to bind to every returned addrinfo. (Which is a design decision that we could revisit.)[addr]:port
. This is backwards-compatible in the sense that both hostnames and IPv4 address strings passed into raft will behave just like they did before. We can continue using AI_V4MAPPED which I think should help with binding/connecting to IPv6 addresses on systems that only have IPv4 set up.getaddrinfo(NULL)
. We could continue passing AF_INET in this case, which would be backwards-compatible, but then some of our tests will continue to fail. Alternatively, we could use AF_UNSPEC, and then we'd have to work out some alternative to binding to every single returned addrinfo. (Special-casing this makes some sense to me anyway.) I have not tested whether using AF_INET6 + AI_V4MAPPED here is a workable solution.As for patching the tests to work on an IPv6-only host (once the other problems in the main codebase have been fixed), it should be mostly a matter of replacing 127.0.0.1
by [::1]
in various places; I don't have a comprehensive patch for that yet.
Can you confirm this please, because I can reproduce this behavior from
0.16.0
(incl.) onwards.
Tested again on a Fedora 39 build root. 0.17.1
can still be built and tested fine. However because of canonical/raft#263 I'm still using this patch for that build and that I haven't adjusted it for 0.18.0 yet.
If I remove this patch the tests indeed also fail with 0.17.1
.
I tried to update dqlite's packaging in Debian to v1.16.4, but am encountering this same issue when trying to build with the now-bundled raft source. tcp_connect/closeDuringConnectAbort
still hangs, and there are several additional test failures (due to output size, I'm not including the various failed tests):
180 of 233 (77%) tests successful, 11 (5%) test skipped.
FAIL raft-uv-integration-test (exit status: 1)
Trying to build with the current cowsql/raft fork results in an error:
libtool: link: gcc -std=c11 -g3 -fcf-protection --param=ssp-buffer-size=4 -pipe -fno-strict-aliasing -fdiagnostics-color -fexceptions -fstack-clash-protection -fstack-protector-strong -fasynchronous-unwind-tables -fdiagnostics-show-option -Wall -Wextra -Wimplicit-fallthrough=5 -Wcast-align -Wstrict-prototypes -Wlogical-op -Wmissing-include-dirs -Wold-style-definition -Winit-self -Wfloat-equal -Wsuggest-attribute=noreturn -Wformat=2 -Wshadow -Wendif-labels -Wdate-time -Wnested-externs -Wconversion -Werror -O2 -Wno-conversion -g -O2 -ffile-prefix-map=/build/dqlite-1.16.4=. -fstack-protector-strong -fstack-clash-protection -Wformat -Werror=format-security -fcf-protection -Wl,-z -Wl,relro -Wl,-z -Wl,now -o integration-test test/integration/integration_test-test_client.o test/integration/integration_test-test_cluster.o test/integration/integration_test-test_fsm.o test/integration/integration_test-test_membership.o test/integration/integration_test-test_node.o test/integration/integration_test-test_role_management.o test/integration/integration_test-test_server.o test/integration/integration_test-test_vfs.o test/integration/integration_test-main.o ./.libs/libtest.a -luv -ldl -lrt -lpthread -lsqlite3 -lraft ./.libs/libdqlite.so -pthread -Wl,-rpath -Wl,/build/dqlite-1.16.4/.libs
/usr/bin/ld: ./.libs/libdqlite.so: undefined reference to `raft_register_state_cb'
collect2: error: ld returned 1 exit status
Possibly relevant versions: Debian sid LXD container with Debian kernel 6.1.0-18-amd64; libsqlite 3.45.1; libuv 1.48.0; liblz4 1.9.4.
I tried to update dqlite's packaging in Debian to v1.16.4, but am encountering this same issue when trying to build with the now-bundled raft source.
tcp_connect/closeDuringConnectAbort
still hangs, and there are several additional test failures (due to output size, I'm not including the various failed tests):180 of 233 (77%) tests successful, 11 (5%) test skipped. FAIL raft-uv-integration-test (exit status: 1)
This could be fixed by applying a change like in https://github.com/cowsql/raft/pull/79, see:
https://github.com/cowsql/raft/pull/79/commits/1a00ebf0482000d8b62ea1932bb88122a951aaba
in particular.
Trying to build with the current cowsql/raft fork results in an error:
libtool: link: gcc -std=c11 -g3 -fcf-protection --param=ssp-buffer-size=4 -pipe -fno-strict-aliasing -fdiagnostics-color -fexceptions -fstack-clash-protection -fstack-protector-strong -fasynchronous-unwind-tables -fdiagnostics-show-option -Wall -Wextra -Wimplicit-fallthrough=5 -Wcast-align -Wstrict-prototypes -Wlogical-op -Wmissing-include-dirs -Wold-style-definition -Winit-self -Wfloat-equal -Wsuggest-attribute=noreturn -Wformat=2 -Wshadow -Wendif-labels -Wdate-time -Wnested-externs -Wconversion -Werror -O2 -Wno-conversion -g -O2 -ffile-prefix-map=/build/dqlite-1.16.4=. -fstack-protector-strong -fstack-clash-protection -Wformat -Werror=format-security -fcf-protection -Wl,-z -Wl,relro -Wl,-z -Wl,now -o integration-test test/integration/integration_test-test_client.o test/integration/integration_test-test_cluster.o test/integration/integration_test-test_fsm.o test/integration/integration_test-test_membership.o test/integration/integration_test-test_node.o test/integration/integration_test-test_role_management.o test/integration/integration_test-test_server.o test/integration/integration_test-test_vfs.o test/integration/integration_test-main.o ./.libs/libtest.a -luv -ldl -lrt -lpthread -lsqlite3 -lraft ./.libs/libdqlite.so -pthread -Wl,-rpath -Wl,/build/dqlite-1.16.4/.libs /usr/bin/ld: ./.libs/libdqlite.so: undefined reference to `raft_register_state_cb' collect2: error: ld returned 1 exit status
This is a known issue, and wouldn't be hard to fix, but since dqlite is now shipping with an embedded raft source code it's probably not worth it, unless somebody really needs that.
Thanks!
We're trying to update the version of raft in Debian and are encountering issues with the tests failing in containerized environments, but not on physical hardware or full VMs. If I build raft in a LXD sid container running on a bookworm host, I get several test failures running
test/integration/uv
(duplicates omitted for brevity):And then the test seems to hang forever:
If I build raft on my physical bookworm system or within a QEMU VM running sid, the tests all pass.
Laszlo also reports encountering the following error when attempting to build within a chroot:
I see this behavior with both raft 0.17.1 and current master. Since it seems to involve libuv, that library is currently at version 1.44.2 in both bookworm and sid. There may be some overlap with canonical/dqlite#581.