axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.7k stars 393 forks source link

Running test buf-ring.t mmap ring register failed -22 #1142

Closed vt-alt closed 2 months ago

vt-alt commented 2 months ago

When running tests for 2.6 some tests fail which (of these only buf-ring.t existed att) was not failing for 2.5. Kernel is 6.1.82 in both cases:

[00:00:22] Running test buf-ring.t                                             mmap ring register failed -22
[00:00:22] test_running(1) mmap failed
[00:00:22] Test buf-ring.t failed with ret 1
[00:00:22] Running test buf-ring-nommap.t                                      Skipped
[00:00:22] Running test buf-ring-put.t                                         Skipped
..
[00:00:27] Running test defer-tw-timeout.t                                     open: Invalid argument
[00:00:27] Test defer-tw-timeout.t failed with ret 1
...
[00:00:32] Running test fixed-buf-merge.t                                      open: Invalid argument
[00:00:32] Test fixed-buf-merge.t failed with ret 1
...
[00:01:08] Tests failed (3): <buf-ring.t> <defer-tw-timeout.t> <fixed-buf-merge.t>

Temporary build log https://git.altlinux.org/tasks/347246/build/200/x86_64/log

Run under strace:

[00:00:10] + strace -Z ./test/buf-ring.t
...
[00:00:13] io_uring_register(3, IORING_REGISTER_PBUF_RING, {ring_addr=NULL, ring_entries=1, bgid=2, flags=0x1}, 1) = -1 EINVAL (Invalid argument)
[00:00:13] mmap ring register failed -22
[00:00:13] test_running(1) mmap failed
[00:00:13] +++ exited with 1 +++

When 2.6 tests are run on Linux 6.6.x buf-ring.t does not fail.

Perhaps, some backward compatibility is lost?

axboe commented 2 months ago

I thought I went over all the "skips sanely on older kernels", but looks like I missed some. In any case, an -EINVAL is really just that, the kernel doesn't support it. I'll give them another round of checking, using 6.1-stable.

vt-alt commented 2 months ago

So I will TEST_EXCLUDE them for now, but I am concerned about buf-ring.t which used to work.

axboe commented 2 months ago

No need to be concerned, tests get changed all the time. Since 2.5, buf-ring had support added for testing mmap'ed rings, which is what the older kernels don't support. This is why it fails.

vt-alt commented 2 months ago

IC. This sounds like a developers approach for testing (to test newest feature on latest kernels). But testing is also useful for QA, in case there is build environment problems, such as compiler bugs, For example LTO build could easily introduce subtle bugs. So, I am reluctant to exclude tests because this will reduce coverage. So, I'd wish tests are backward compatible and not just excludable.

axboe commented 2 months ago

Obviously test cases are written to either test and verify new features, or added when a bug is filed to ensure we don't regress there. And unfortunately the test cases don't always remember to skip testing on older kernels, which is how you end up with an issue like this. Your implication that the test cases only exist to test the "newest features on the latest kernels" is patently incorrect.

I'm quite happy to take patches correcting this. Unfortunately most reports like this only come in at the very worst possible time, right after a release. If you had tested 2 weeks ago, then it could've been squeaky clean for release. This isn't your fault, it's just the nature of the game that most people only test actual releases, not pre-releases or the -git tree.

axboe commented 2 months ago

[00:00:27] Running test defer-tw-timeout.t open: Invalid argument [00:00:32] Running test fixed-buf-merge.t open: Invalid argument

These two are an issue on your end, whatever fs you're running the tests on doesn't support O_DIRECT.

axboe commented 2 months ago

Pushed fixes.

vt-alt commented 2 months ago

IC, thanks for clarification.

Downstream distributions usually monitor for new git tags (or tarballs on GitHub releases/tags page through watchfiles which are correlate with tags), so I'd suggest you make pre-release tag like 2.6rc1 and keep it for a week or two, and this could be better noticed by testers.

[00:00:27] Running test defer-tw-timeout.t open: Invalid argument [00:00:32] Running test fixed-buf-merge.t open: Invalid argument

These two are an issue on your end, whatever fs you're running the tests on doesn't support O_DIRECT.

Yeah, this is tmpfs.

axboe commented 2 months ago

I can try and make pre-release tags. But I've done that with other projects in the past, and it hasn't made any difference whatsoever. I've been around the block a few times, not my first rodeo. But it's worth a shot if it just catches one thing.

vt-alt commented 2 months ago

Thanks for the fixes!

Well, everybody is different but I test pre-release tags. For example for liboqs: https://github.com/open-quantum-safe/liboqs/issues/1483 0.8.0-rc1 https://github.com/open-quantum-safe/liboqs/issues/1723 0.10.0-rc1

Smooth test suite really helps with that — that does not require a manual picking of working tests and auto-skips non-runnable w/o root, arch specific, or kernel version dependent tests. In that case people aren't lazy to add make check to the package recipe or test VM pipeline.

For example, for liburing previously I did a long exclude list to make it run w/o root:

TEST_EXCLUDE="
        500f9fbadef8.t
        accept.t
        cq-overflow.t
        eeed8b54e0df.t
        file-verify.t
        fpos.t
        hardlink.t
        io-cancel.t
        iopoll.t
        io_uring_register.t
        link-timeout.t
        personality.t
        read-before-exit.t
        read-write.t
        recv-msgall-stream.t
        ringbuf-read.t
        rsrc_tags.t
        sq-poll-dup.t
        sq-poll-share.t
        sync-cancel.t
        wq-aff.t
" make runtests

See in comparison, Fedora does not run tests, perhaps because they failed on the first run: https://src.fedoraproject.org/rpms/liburing/blob/rawhide/f/liburing.spec

Debian run tests but ignore test results: https://buildd.debian.org/status/fetch.php?pkg=liburing&arch=amd64&ver=2.5-1%2Bb1&stamp=1714228443&raw=0 perhaps by the same reason.

dh_auto_test -- runtests || true
    make -j6 test "TESTSUITEFLAGS=-j6 --verbose" VERBOSE=1 runtests
...
Tests failed (3): <file-verify.t> <io_uring_register.t> <sq-poll-share.t>

For me now only these tests are failing (they was in TEST_EXCLUDE so I did not notice it until now):

23:28:55 Running test accept.t                                               Broken overflow handling
23:28:55 test_multishot_accept(1, false, true) failed
23:28:55 Test accept.t failed with ret 1
...
23:29:22 Running test io_uring_register.t                                    expected Unknown error -14, but call succeeded
23:29:22 FAIL
23:29:22 Unable to map a huge page.  Try increasing /proc/sys/vm/nr_hugepages by at least 1.
23:29:22 Skipping the hugepage test
23:29:22 Test io_uring_register.t failed with ret 1
...
23:30:09 Tests failed (2): <accept.t> <io_uring_register.t>
axboe commented 2 months ago

Your bottom tests are failing exactly because they should fail, you're kernel doesn't have the necessary fixes. I just ran the tests on 6.1.89 after the two fixes, and it passes cleanly for me.

I do run them as root generally while testing, so it's possible that some will fail for a regular user. Should make that part of the usual testing too. But everything passes in the current tree as a regular user, for me, using that same 6.1-stable kernel.

axboe commented 2 months ago

Nobody should run the tests are part of building the library, that's not why they are there. It should be part of separate testing.

vt-alt commented 2 months ago

Nobody should run the tests are part of building the library, that's not why they are there. It should be part of separate testing.

They run after everything is already built and installed. make check approach is quite common. To make it completely separate process test-suite should be installed, and we can package it and run in separate test VM.

Your bottom tests are failing exactly because they should fail, you're kernel doesn't have the necessary fixes. I just ran the tests on 6.1.89 after the two fixes, and it passes cleanly for me.

Build env (which also run the after build tests) is run on 6.1.82 and I cannot change that, so perhaps it's easier to skip the tests.

For me now only these tests are failing

JFYI That is when run locally, but when run on build env two more tests are failed:

[00:00:32] Running test file-verify.t                                          open: Invalid argument
[00:00:32] O_DIRECT novec test failed
[00:00:33] Test file-verify.t failed with ret 1
...
[00:01:24] Running test sq-poll-share.t                                        open: Invalid argument
[00:01:24] Test sq-poll-share.t failed with ret 255
...
[00:01:42] Tests failed (2): <file-verify.t> <sq-poll-share.t>

I do run them as root generally while testing

That's good idea and I can try that too as we have qemu wrapper script. This could be not the latest mainline/stable but 6.6.x lts, though.

axboe commented 2 months ago

Those two tests are again failing because you're running it in an fs that doesn't support O_DIRECT.

vt-alt commented 2 months ago

That's good idea and I can try that too as we have qemu wrapper script. This could be not the latest mainline/stable but 6.6.x lts, though.

When run as "root" in qemu-system quick VM on Linux 6.6.29 with ext4 All tests passed for aarch64, x86_64 but some tests fail for i586 and ppc64le.:

i586

[00:01:59] Running test sqpoll-sleep.t                                         Test sqpoll-sleep.t failed with ret 1
...
[00:02:13] Tests failed (1): <sqpoll-sleep.t>

ppc64le

[00:00:45] Running test buf-ring-nommap.t                                      queue init failed -12
[00:00:45] Test buf-ring-nommap.t failed with ret 1
...
[00:01:47] Running test recv-multishot.t                                       connect failed
[00:01:47] t_create_socket_pair failed: 4
[00:01:47] test stream=1 wait_each=1 recvmsg=0 early_error=4  defer=0 failed
[00:01:47] Test recv-multishot.t failed with ret 1
...
[00:01:50] Running test send-zerocopy.t                                        invalid cqe->res -90 expected 65536
[00:01:50] send failed fixed buf 0, conn 0, addr 1, cork 0
[00:01:50] test_inet_send() failed (defer_taskrun 0)
[00:01:50] Test send-zerocopy.t failed with ret 1

[00:02:10] Tests failed (3): <buf-ring-nommap.t> <recv-multishot.t> <send-zerocopy.t>

❓ Should I fill these as another bug report?

ps. We can run tests in qemu but some other distros (like Fedora) may not do that (because it's common to do make check after build style testing w/o root) and they may also build over tmpfs (faster build times).

vt-alt commented 2 months ago

I can try and make pre-release tags. But I've done that with other projects in the past, and it hasn't made any difference whatsoever.

I agree that most people will ignore -rc releases altogether. So, the most effective (and visible) approach, perhaps, would be just to make patchelvel release, like 2.6.1, if it's worth it.