canonical / dqlite

Embeddable, replicated and fault-tolerant SQL engine.
https://dqlite.io
Other
3.8k stars 214 forks source link

`test/integration/uv` failing when run in a container #574

Closed gibmat closed 5 months ago

gibmat commented 10 months ago

We're trying to update the version of raft in Debian and are encountering issues with the tests failing in containerized environments, but not on physical hardware or full VMs. If I build raft in a LXD sid container running on a bookworm host, I get several test failures running test/integration/uv (duplicates omitted for brevity):

recv/first                                                  [ ERROR ]
Error: test/integration/test_uv_recv.c:256: assertion failed: rv == 0 (18 == 0)
Error: child killed by signal 6 (Aborted)

Error: test/integration/test_uv_send.c:141: uv_run: condition not met in 20 iterations
Error: child killed by signal 6 (Aborted)

tcp_connect/first                                           [ ERROR ]
Error: test/integration/test_uv_tcp_connect.c:47: assertion failed: status == result->status (16 == 0)
Error: child killed by signal 6 (Aborted)

tcp_connect/refused                                         [ ERROR ]
Error: test/integration/test_uv_tcp_connect.c:222: assertion failed: string f->transport.errmsg == "uv_tcp_connect(): connection refused" ("uv_getaddrinfo(): EAI_NONAME" == "uv_tcp_connect(): connection refused")
Error: child killed by signal 6 (Aborted)

And then the test seems to hang forever:

tcp_connect/closeDuringDnsLookupAbort                       [ OK    ] [ 0.00051087 / 0.00069748 CPU ]
tcp_connect/closeDuringConnectAbort                         ^C

If I build raft on my physical bookworm system or within a QEMU VM running sid, the tests all pass.

Laszlo also reports encountering the following error when attempting to build within a chroot:

Error: test/unit/test_uv_fs.c:320: assertion failed: rv_ == RAFT_IOERR (0 == 18)

I see this behavior with both raft 0.17.1 and current master. Since it seems to involve libuv, that library is currently at version 1.44.2 in both bookworm and sid. There may be some overlap with canonical/dqlite#581.

MathieuBordere commented 10 months ago

Thanks, taking a look.

update 1: Cannot reproduce on Ubuntu 22.04 host with 6.2 kernel running the test in an LXD sid container nor in a bookworm container.

MathieuBordere commented 10 months ago

Aha, I can reproduce if I block my container from accessing the outside world. my container isn't assigned an IPv4 address (which is fortunately an issue I ran into by accident, but allows to reproduce this problem). If I disable my host firewall my LXD container is assigned an IPv4 address and the tests succeed. Not sure if this gives you something useful to work with ... don't know yet if this is a problem on our side.

gibmat commented 10 months ago

Ah, that's interesting, and reflects my setup. The containers I spin up for packaging work are on an IPv6-only network segment, although they do have both IPv4 and IPv6 addresses setup on the loopback interface. My bookworm machine has both addresses, and the QEMU VM only has IPv4 due to the simple NAT I configured.

I haven't looked into how the tests are trying to bind to addresses -- are they assuming there's an IPv4 address they can use on some "real" (non-loopback) interface?

cole-miller commented 10 months ago

@gibmat Thanks for the report. Just for some context -- which release of libraft are you upgrading from, and do the same tests pass in the same containerized environment when run against that earlier version?

It's entirely possible that we have some IPv4-related implicit assumptions in our networking code; I'm looking into what the specific problem might be.

cole-miller commented 10 months ago

So at least some of the failures here are caused by failing to resolve 127.0.0.1:9001, which is hardcoded for the tests here:

https://github.com/canonical/raft/blob/b68076fdc7d34a901cc2eb5eb480832c709b2532/test/lib/uv.h#L51

We could conceivably make it possible to run the tests using a different address/port.

(At least, I was able to reproduce the same pattern of test failures by doing ip addr del on the assigned IPv4 address inside a LXD container and running the test suite there; and I traced some of the test failures in that case to failing to a failed call to getaddrinfo against 127.0.0.1:9001 inside uvIpResolveBindAddresses.)

More generally, the code that implements the libuv TCP transport for raft is not IPv6-capable right now.

cole-miller commented 10 months ago

The reason that 127.0.0.1 can't be resolved here despite the existence of the loopback interface is that we pass the AI_ADDRCONFIG flag to getaddrinfo.

ganto commented 10 months ago

When trying to build the raft 0.18.0 for Fedora 37 or 38 via namespaced mock buildroot I'm facing the same behaviour as @gibmat. The tests remain stuck at:

  address=localhost:9000, bind-address=localhost:9000       [ ERROR ]
Error: test/integration/test_uv_tcp_listen.c:211: assertion failed: rv == 0 (18 == 0)
Error: child killed by signal 6 (Aborted)
  address=localhost:9000, bind-address=:9000                [ ERROR ]
Error: test/integration/test_uv_tcp_listen.c:211: assertion failed: rv == 0 (18 == 0)
Error: child killed by signal 6 (Aborted)
  address=localhost:9000, bind-address=0.0.0.0:9000         [ ERROR ]
Error: test/integration/test_uv_tcp_listen.c:211: assertion failed: rv == 0 (18 == 0)
Error: child killed by signal 6 (Aborted)
tcp_connect/closeDuringSecondConnect                        [ OK    ] [ 0.00023986 / 0.00029068 CPU ]
tcp_connect/closeDuringConnectAbort

The build environment only has a loopback device:

<mock-chroot> sh-5.2# ip a                                                                                                                                                                                                                                                                                                    
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000                                                                                                                                                                                                                                   
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00                                                                                                                                                                                                                                                                     
    inet 127.0.0.1/8 scope host lo                                                                                                                                                                                                                                                                                            
       valid_lft forever preferred_lft forever                                                                                                                                                                                                                                                                                
    inet6 ::1/128 scope host                                                                                                                                                                                                                                                                                                  
       valid_lft forever preferred_lft forever

The exact same build works fine for raft 0.17.1.

cole-miller commented 10 months ago

Thanks for the report @ganto. I have an in-progress PR to rework how we call getaddrinfo in uv_ip.c that will hopefully enable running the test suite smoothly when there is no non-loopback IPv4 address available.

MathieuBordere commented 10 months ago

The exact same build works fine for raft 0.17.1.

Can you confirm this please, because I can reproduce this behavior from 0.16.0 (incl.) onwards.

cole-miller commented 10 months ago

Summary of where I think the problems are and how to approach fixing them:

cole-miller commented 10 months ago

As for patching the tests to work on an IPv6-only host (once the other problems in the main codebase have been fixed), it should be mostly a matter of replacing 127.0.0.1 by [::1] in various places; I don't have a comprehensive patch for that yet.

ganto commented 10 months ago

Can you confirm this please, because I can reproduce this behavior from 0.16.0 (incl.) onwards.

Tested again on a Fedora 39 build root. 0.17.1 can still be built and tested fine. However because of canonical/raft#263 I'm still using this patch for that build and that I haven't adjusted it for 0.18.0 yet.

If I remove this patch the tests indeed also fail with 0.17.1.

gibmat commented 5 months ago

I tried to update dqlite's packaging in Debian to v1.16.4, but am encountering this same issue when trying to build with the now-bundled raft source. tcp_connect/closeDuringConnectAbort still hangs, and there are several additional test failures (due to output size, I'm not including the various failed tests):

180 of 233 (77%) tests successful, 11 (5%) test skipped.
FAIL raft-uv-integration-test (exit status: 1)

Trying to build with the current cowsql/raft fork results in an error:

libtool: link: gcc -std=c11 -g3 -fcf-protection --param=ssp-buffer-size=4 -pipe -fno-strict-aliasing -fdiagnostics-color -fexceptions -fstack-clash-protection -fstack-protector-strong -fasynchronous-unwind-tables -fdiagnostics-show-option -Wall -Wextra -Wimplicit-fallthrough=5 -Wcast-align -Wstrict-prototypes -Wlogical-op -Wmissing-include-dirs -Wold-style-definition -Winit-self -Wfloat-equal -Wsuggest-attribute=noreturn -Wformat=2 -Wshadow -Wendif-labels -Wdate-time -Wnested-externs -Wconversion -Werror -O2 -Wno-conversion -g -O2 -ffile-prefix-map=/build/dqlite-1.16.4=. -fstack-protector-strong -fstack-clash-protection -Wformat -Werror=format-security -fcf-protection -Wl,-z -Wl,relro -Wl,-z -Wl,now -o integration-test test/integration/integration_test-test_client.o test/integration/integration_test-test_cluster.o test/integration/integration_test-test_fsm.o test/integration/integration_test-test_membership.o test/integration/integration_test-test_node.o test/integration/integration_test-test_role_management.o test/integration/integration_test-test_server.o test/integration/integration_test-test_vfs.o test/integration/integration_test-main.o  ./.libs/libtest.a -luv -ldl -lrt -lpthread -lsqlite3 -lraft ./.libs/libdqlite.so -pthread -Wl,-rpath -Wl,/build/dqlite-1.16.4/.libs
/usr/bin/ld: ./.libs/libdqlite.so: undefined reference to `raft_register_state_cb'
collect2: error: ld returned 1 exit status

Possibly relevant versions: Debian sid LXD container with Debian kernel 6.1.0-18-amd64; libsqlite 3.45.1; libuv 1.48.0; liblz4 1.9.4.

freeekanayaka commented 5 months ago

I tried to update dqlite's packaging in Debian to v1.16.4, but am encountering this same issue when trying to build with the now-bundled raft source. tcp_connect/closeDuringConnectAbort still hangs, and there are several additional test failures (due to output size, I'm not including the various failed tests):

180 of 233 (77%) tests successful, 11 (5%) test skipped.
FAIL raft-uv-integration-test (exit status: 1)

This could be fixed by applying a change like in https://github.com/cowsql/raft/pull/79, see:

https://github.com/cowsql/raft/pull/79/commits/1a00ebf0482000d8b62ea1932bb88122a951aaba

in particular.

Trying to build with the current cowsql/raft fork results in an error:

libtool: link: gcc -std=c11 -g3 -fcf-protection --param=ssp-buffer-size=4 -pipe -fno-strict-aliasing -fdiagnostics-color -fexceptions -fstack-clash-protection -fstack-protector-strong -fasynchronous-unwind-tables -fdiagnostics-show-option -Wall -Wextra -Wimplicit-fallthrough=5 -Wcast-align -Wstrict-prototypes -Wlogical-op -Wmissing-include-dirs -Wold-style-definition -Winit-self -Wfloat-equal -Wsuggest-attribute=noreturn -Wformat=2 -Wshadow -Wendif-labels -Wdate-time -Wnested-externs -Wconversion -Werror -O2 -Wno-conversion -g -O2 -ffile-prefix-map=/build/dqlite-1.16.4=. -fstack-protector-strong -fstack-clash-protection -Wformat -Werror=format-security -fcf-protection -Wl,-z -Wl,relro -Wl,-z -Wl,now -o integration-test test/integration/integration_test-test_client.o test/integration/integration_test-test_cluster.o test/integration/integration_test-test_fsm.o test/integration/integration_test-test_membership.o test/integration/integration_test-test_node.o test/integration/integration_test-test_role_management.o test/integration/integration_test-test_server.o test/integration/integration_test-test_vfs.o test/integration/integration_test-main.o  ./.libs/libtest.a -luv -ldl -lrt -lpthread -lsqlite3 -lraft ./.libs/libdqlite.so -pthread -Wl,-rpath -Wl,/build/dqlite-1.16.4/.libs
/usr/bin/ld: ./.libs/libdqlite.so: undefined reference to `raft_register_state_cb'
collect2: error: ld returned 1 exit status

This is a known issue, and wouldn't be hard to fix, but since dqlite is now shipping with an embedded raft source code it's probably not worth it, unless somebody really needs that.

gibmat commented 4 months ago

Thanks!