SIGSEGV in file_server - Githubissues

CarterLi / liburing4cpp

Modern C++ binding for liburing (io_uring) that features C++ coroutines support

MIT License

281 stars 20 forks source link

SIGSEGV in file_server #9

Open MatthiasMann opened 3 years ago

MatthiasMann commented 3 years ago

Based on the configuration and work around from #8 I get a SIGSEGV:

Program received signal SIGSEGV, Segmentation fault.
io_uring_get_sqe (ring=0x41b58ab3) at queue.c:323
323     return __io_uring_get_sqe(sq, io_uring_smp_load_acquire(sq->khead));
(gdb) p sq
$1 = (struct io_uring_sq *) 0x41b58ab3
(gdb) p *sq
Cannot access memory at address 0x41b58ab3
(gdb) bt
#0  io_uring_get_sqe (ring=0x41b58ab3) at queue.c:323
#1  0x0004a646 in io_service::io_uring_get_sqe_safe (this=0x41b58ab3) at ../io_service.hpp:631
#2  0x0004a2dc in io_service::close (this=0x41b58ab3, fd=6, iflags=0 '\000') at ../io_service.hpp:515
#3  0x0004124e in accept_connection(io_service&, int, int)::{lambda(int)#1}::operator() (frame_ptr=0xf4601a60) at ./file_server.cpp:119
#4  0x00045062 in std::__n4861::coroutine_handle<void>::resume (this=0xf440368c) at /opt/OSELAS.Toolchain-2020.08.0/arm-v7a-linux-gnueabihf/gcc-10.2.1-clang-10.0.1-glibc-2.32-binutils-2.35-kernel-5.8-sanitized/arm-v7a-linux-gnueabihf/include/c++/10.2.1/coroutine:126
#5  0x0004ad36 in task_promise_base<void, false>::final_suspend()::Awaiter::await_suspend(std::__n4861::coroutine_handle<void>) const (this=0xf4403b2c, caller=...) at ../task.hpp:37
#6  0x0004071e in serve (frame_ptr=0xf4403680) at ./file_server.cpp:97
#7  0x00045062 in std::__n4861::coroutine_handle<void>::resume (this=0xf420388c) at /opt/OSELAS.Toolchain-2020.08.0/arm-v7a-linux-gnueabihf/gcc-10.2.1-clang-10.0.1-glibc-2.32-binutils-2.35-kernel-5.8-sanitized/arm-v7a-linux-gnueabihf/include/c++/10.2.1/coroutine:126
#8  0x0004ad36 in task_promise_base<void, false>::final_suspend()::Awaiter::await_suspend(std::__n4861::coroutine_handle<void>) const (this=0xf4203e48, caller=...) at ../task.hpp:37
#9  0x0003f2c2 in http_send_file (frame_ptr=0xf4203880) at ./file_server.cpp:75
#10 0x00045062 in std::__n4861::coroutine_handle<void>::resume (this=0xf46064cc) at /opt/OSELAS.Toolchain-2020.08.0/arm-v7a-linux-gnueabihf/gcc-10.2.1-clang-10.0.1-glibc-2.32-binutils-2.35-kernel-5.8-sanitized/arm-v7a-linux-gnueabihf/include/c++/10.2.1/coroutine:126
#11 0x000496c6 in task_promise_base<int, true>::final_suspend()::Awaiter::await_suspend(std::__n4861::coroutine_handle<void>) const (this=0xf4606504, caller=...) at ../task.hpp:37
#12 0x0003c2ca in io_service::_ZN10io_service10await_workEP12io_uring_sqeh.actor(io_service::_ZN10io_service10await_workEP12io_uring_sqeh.frame *) (frame_ptr=0xf46064c0) at ../io_service.hpp:622
#13 0x00045062 in std::__n4861::coroutine_handle<void>::resume (this=0xf46064f0) at /opt/OSELAS.Toolchain-2020.08.0/arm-v7a-linux-gnueabihf/gcc-10.2.1-clang-10.0.1-glibc-2.32-binutils-2.35-kernel-5.8-sanitized/arm-v7a-linux-gnueabihf/include/c++/10.2.1/coroutine:126
#14 0x00051a94 in promise<int, true>::resolve<int&, void> (this=0xf46064ec, u=<error reading variable>) at ../promise.hpp:60
#15 0x0004f6f6 in io_service::run<void, false> (this=0xffd594f0, t=...) at ../io_service.hpp:663
#16 0x00042e90 in main (argc=2, argv=0xffd59754) at ./file_server.cpp:151
(gdb) up
#1  0x0004a646 in io_service::io_uring_get_sqe_safe (this=0x41b58ab3) at ../io_service.hpp:631
631         auto* sqe = io_uring_get_sqe(&ring);
(gdb) 
#2  0x0004a2dc in io_service::close (this=0x41b58ab3, fd=6, iflags=0 '\000') at ../io_service.hpp:515
515         auto* sqe = io_uring_get_sqe_safe();
(gdb) p *this
Cannot access memory at address 0x41b58ab3

This looks to me that the coroutine created in https://github.com/CarterLi/liburing4cpp/blob/async/demo/file_server.cpp#L102 got destroyed before the code reaches https://github.com/CarterLi/liburing4cpp/blob/async/demo/file_server.cpp#L114.

MatthiasMann commented 3 years ago

After some debugging it looks more like the address of service gets corrupted somehow. For testing I stored all instances of this task<> in a std::vector and also added prints: Before the try/catch block the address of service was 0xfff1bf50 and after it is 0x41b58ab3 - no idea currently what is happening.

CarterLi commented 3 years ago

~~It's strange that echo_server doesn't have the similar issue, through they have the similar code.~~

Ok they resulted in the similar error when compiling with g++-10

I have to say that they used to work -_-|| Will investigate it more tomorrow.

MatthiasMann commented 3 years ago

I haven't actually tested echo server.

I also tried to compile using clang - but that had other issues on my end (couldn't find it's C++ header files like string).

CarterLi commented 3 years ago

I haven't actually tested echo server.

I also tried to compile using clang - but that had other issues on my end (couldn't find it's C++ header files like string).

Reinstalling libc++-dev may resolve this issue

MatthiasMann commented 3 years ago

If it would be a normal clang installation for the host itself then maybe - but in this case it's a cross platform tool chain (OSELAS-Toolchain). I'll look into what when wrong while building it another day.

codeinred commented 3 years ago

Regarding this issue, have you tried compiling the library with g++-11? When working on a coroutine library I noticed that g++-10 had a number of small bugs in the way it handled coroutines. These bugs didn't appear in clang 10, but they were still annoying to work around.

hflsmax commented 3 years ago

I used g++-11 and it still segfault when running echo_server. Here's the backtrace

Core was generated by `./echo_server 8080'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  io_uring_get_sqe (ring=0x82ad5fcaf127a800) at queue.c:383
383             unsigned int head = io_uring_smp_load_acquire(sq->khead);
(gdb) bt
#0  io_uring_get_sqe (ring=0x82ad5fcaf127a800) at queue.c:383
#1  0x0000555555560bec in io_service::io_uring_get_sqe_safe (this=0x82ad5fcaf127a800) at /home/cream/src/liburing4cpp/include/io_service.hpp:759
#2  0x0000555555560598 in io_service::send (this=0x82ad5fcaf127a800, sockfd=5, buf=0x5555555b1060, nbytes=2, flags=16384, iflags=0 '\000')
    at /home/cream/src/liburing4cpp/include/io_service.hpp:410
#3  0x000055555555d1bd in accept_connection(io_service&, int)::{lambda(int)#1}::operator() (frame_ptr=0x5555555b0fc0)
    at /home/cream/src/liburing4cpp/demo/echo_server.cpp:75
#4  0x000055555555cbef in io_service::_ZN10io_service10await_workEP12io_uring_sqeh.actor(io_service::_ZN10io_service10await_workEP12io_uring_sqeh.frame *) (frame_ptr=0x5555555b0f30) at /home/cream/src/liburing4cpp/include/io_service.hpp:738
#5  0x000055555555e6ab in std::__n4861::coroutine_handle<void>::resume (this=0x5555555b0f88) at /usr/include/c++/11/coroutine:126
#6  0x0000555555563475 in promise<int, true>::resolve<int&, void> (this=0x5555555b0f80, u=@0x7ffff7fc4158: 0)
    at /home/cream/src/liburing4cpp/include/promise.hpp:60
#7  0x000055555556211c in io_service::run<void, false> (this=0x7fffffffda10, t=...) at /home/cream/src/liburing4cpp/include/io_service.hpp:791
#8  0x000055555555de0a in main (argc=2, argv=0x7fffffffdc48) at /home/cream/src/liburing4cpp/demo/echo_server.cpp:104

@codeinred are you able to get them running? What compiler did you use?

CarterLi commented 3 years ago

I used g++-11 and it still segfault when running echo_server. Here's the backtrace

Core was generated by `./echo_server 8080'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  io_uring_get_sqe (ring=0x82ad5fcaf127a800) at queue.c:383
383             unsigned int head = io_uring_smp_load_acquire(sq->khead);
(gdb) bt
#0  io_uring_get_sqe (ring=0x82ad5fcaf127a800) at queue.c:383
#1  0x0000555555560bec in io_service::io_uring_get_sqe_safe (this=0x82ad5fcaf127a800) at /home/cream/src/liburing4cpp/include/io_service.hpp:759
#2  0x0000555555560598 in io_service::send (this=0x82ad5fcaf127a800, sockfd=5, buf=0x5555555b1060, nbytes=2, flags=16384, iflags=0 '\000')
    at /home/cream/src/liburing4cpp/include/io_service.hpp:410
#3  0x000055555555d1bd in accept_connection(io_service&, int)::{lambda(int)#1}::operator() (frame_ptr=0x5555555b0fc0)
    at /home/cream/src/liburing4cpp/demo/echo_server.cpp:75
#4  0x000055555555cbef in io_service::_ZN10io_service10await_workEP12io_uring_sqeh.actor(io_service::_ZN10io_service10await_workEP12io_uring_sqeh.frame *) (frame_ptr=0x5555555b0f30) at /home/cream/src/liburing4cpp/include/io_service.hpp:738
#5  0x000055555555e6ab in std::__n4861::coroutine_handle<void>::resume (this=0x5555555b0f88) at /usr/include/c++/11/coroutine:126
#6  0x0000555555563475 in promise<int, true>::resolve<int&, void> (this=0x5555555b0f80, u=@0x7ffff7fc4158: 0)
    at /home/cream/src/liburing4cpp/include/promise.hpp:60
#7  0x000055555556211c in io_service::run<void, false> (this=0x7fffffffda10, t=...) at /home/cream/src/liburing4cpp/include/io_service.hpp:791
#8  0x000055555555de0a in main (argc=2, argv=0x7fffffffdc48) at /home/cream/src/liburing4cpp/demo/echo_server.cpp:104

@codeinred are you able to get them running? What compiler did you use?

Seems the only working compiler is clang-9. The stack is cracked as long as mutiple coroutines run in parallel.

The crashes can be easily avoided by removing Line 24. In this case only one connection can be accepted at a time.

https://github.com/CarterLi/liburing4cpp/blob/a9b422baaf4d5c587cbdc95379a1e2e145bddeb5/demo/echo_server.cpp#L24

CarterLi commented 3 years ago

All implementations of C++20 coroutine libraries I can find on the Internet are lazy - the coroutine body wont be executed before it being co_awaited, which means coroutines can never be run in parallel without threads.

That makes io_uring mostly useless. I dont bother to do that.

CarterLi commented 3 years ago

I'd suggest that you'd better use stackful coroutines for now for both stablility and performance:

no standard coroutine library available. Implementation of clang is still marked as partial.
a promise object is allocated in heap every time you call a coroutine function. It is said the heap allocation can be avoided with some tricks if the promise object is small enough. I dont think it can be easy.

codeinred commented 3 years ago

With regard to coroutine libraries being lazy: In the typical case, an executor can also start a coroutine before it's co_await'd on. So the coroutine is lazy because all the coroutines can be registered with an executor, before they're started.

As for allocations with promise objects: calling a function like read or write should just return an awaitable. These awaitables shouldn't allocate, but right now they're implemented as tasks in your library, which leads to unnecessary allocation and also may result in a race condition.

We can avoid this allocation entirely be changing the way value resolution occurs. Rather than a promise type having a resolve function, we can write a really simple resolver like so:

struct cqe_resolver {
    std::coroutine_handle<> handle;
    int result {};

    inline void resolve(int result) {
        this->result = result;
        handle.resume();
    }
};

We can then write a non-allocating trivially destructible awaitable type that represents any io_uring operation. I'm still studying liburing, so this may not be entirely correct, but it's what I understand from reading your code:

struct sqe_awaitable {
    cqe_resolver resolver {};
    io_uring_sqe* sqe = nullptr;
    io_uring& ring;
    int& cqe_count;
    uint8_t iflags = 0;

    // Create the awaitable. This doesn't actually register anything, as that
    // occurs once the coroutine suspends. Doing it *before* the coroutine 
    // suspends has the potential to result in a race condition, but we can just
    // register it inside await_suspend()
    sqe_awaitable(io_uring& ring, int& cqe_count, uint8_t iflags = 0)
      : ring(ring)
      , cqe_count(cqe_count)
      , iflags(iflags) {}

    constexpr bool await_ready() const noexcept { return false; }

    void await_suspend(std::coroutine_handle<> handle) {
        sqe = io_uring_get_sqe(&ring);
        if (!sqe) {
            // SQ is full, flushing cqe(s)
            io_uring_cq_advance(&ring, cqe_count);
            cqe_count = 0;
            io_uring_submit(&ring);
            sqe = io_uring_get_sqe(&ring);
            assert(sqe && "sqe should not be NULL");
        }

        // Register the handle with the resolver, so that it can be called
        // once resolution occurs.

        // The resolver contains the data needed to resume the coroutine once
        // the result has been obtained. The coroutine will be resumed when
        // resolver.resolve() is invoked with the io_uring_cqe::res value
        resolver.handle = handle;

        // Register the resolver with the io_uring Submission Query Entry data
        // structure, so that it knows to resume that resolver
        io_uring_sqe_set_flags(sqe, iflags);
        io_uring_sqe_set_data(sqe, &resolver);
    }

    constexpr int await_resume() const noexcept { return resolver.result; }

    void cancel() {
        io_uring_prep_cancel(sqe, &resolver, 0);
    }
};

Whenever you read, write, or do another async operation, it just creates and returns the above awaitable type on the stack (no allocation), which registers itself with the ring. No allocation occurs, and everyone's happy!

Coincidentally, this also happens to avoid the bug I mentioned above in gcc-10. That bug only applies when awaitables created in a co_await statement have a non-trivial destructor, but because sqe_awaitable is non-owning, it's trivially destructible.

codeinred commented 3 years ago

Hi everyone,

Given the recent code cleanups, and the addition of sqe_awaitable, I'm not sure this bug still exists. I've attached an echo client I'm using to do the test below. When I run the echo server, I get the following output:

build/echo_server 1234
Listening: 1234
sockfd 5 is accepted; number of running coroutines: 1
sockfd 5 is closed; number of running coroutines: 0
sockfd 5 is accepted; number of running coroutines: 1
sockfd 5 is closed; number of running coroutines: 0
sockfd 5 is accepted; number of running coroutines: 1
sockfd 5 is closed; number of running coroutines: 0

This comes from running the echo client, which receives the response it's supposed to:

alecto@styx:liburing4cpp$ ./echo-client.py --port 1234
Connecting to localhost port 1234
Sending Test message. This will be echoed
Received: Test message. Th
Received: is will be echoe
Received: d
Closing connection to the server
alecto@styx:liburing4cpp$ ./echo-client.py --port 1234
Connecting to localhost port 1234
Sending Test message. This will be echoed
Received: Test message. Th
Received: is will be echoe
Received: d
Closing connection to the server
alecto@styx:liburing4cpp$ ./echo-client.py --port 1234
Connecting to localhost port 1234
Sending Test message. This will be echoed
Received: Test message. Th
Received: is will be echoe
Received: d
Closing connection to the server

The echo server accepts multiple successive incoming connections without issue.

Shown below is the code for the echo client, which I took from the python network programming cookbook:

#!/usr/bin/env python2
# coding=utf8
# Python Network Programming Cookbook -- Chapter – 1
# This program is optimized for Python 2.7. It may run on any
# other Python version with/without modifications.

import socket
import sys

import argparse

host = 'localhost'

def echo_client(port):
    """ A simple echo client """
    # Create a TCP/IP socket
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    # Connect the socket to the server
    server_address = (host, port)
    print "Connecting to %s port %s" % server_address
    sock.connect(server_address)

    # Send data
    try:
        # Send data
        message = "Test message. This will be echoed"
        print "Sending %s" % message
        sock.sendall(message)
        # Look for the response
        amount_received = 0
        amount_expected = len(message)
        while amount_received < amount_expected:
            data = sock.recv(16)
            amount_received += len(data)
            print "Received: %s" % data
    except socket.errno, e:
        print "Socket error: %s" %str(e)
    except Exception, e:
        print "Other exception: %s" %str(e)
    finally:
        print "Closing connection to the server"
        sock.close()

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Socket Server Example')
    parser.add_argument('--port', action="store", dest="port", type=int, required=True)
    given_args = parser.parse_args() 
    port = given_args.port
    echo_client(port)

Could I have some guidance on how to recreate this bug, if it still exists? I also tried running the echo server with the memory sanitizer enabled, and in valgrind, and both of those were happy with it.

hflsmax commented 3 years ago

Thanks for the effort! @codeinred I can still reproduce the bug using g++-11 after reverting the temporary workaround introduced here https://github.com/CarterLi/liburing4cpp/commit/324f21c3889c546e8d203ef5057926bb02c21810#diff-df6aad85015654783d708d610b1aec0ede8a3c8e29cf0057f3c7dd4ede9cbfcd

codeinred commented 3 years ago

I see, thank you! This gives me a good starting point!

rzbdz commented 2 years ago

I think maybe the problem before workaround occurs because of the lambda coroutine ?

https://github.com/CarterLi/liburing4cpp/blob/a9b422baaf4d5c587cbdc95379a1e2e145bddeb5/demo/echo_server.cpp#L24 Before the workaround, in this line, the lambda object do capture 2 variables, but soon after the lambda expires it's lifetime when it was called then suspended, the inner while loop scope exit, the object might be gone. The same thing just happends in the file_server.

FYI, This and This.