Open MatthiasMann opened 3 years ago
After some debugging it looks more like the address of service gets corrupted somehow. For testing I stored all instances of this task<> in a std::vector and also added prints: Before the try/catch block the address of service was 0xfff1bf50 and after it is 0x41b58ab3 - no idea currently what is happening.
It's strange that echo_server doesn't have the similar issue, through they have the similar code.
Ok they resulted in the similar error when compiling with g++-10
I have to say that they used to work -_-|| Will investigate it more tomorrow.
I haven't actually tested echo server.
I also tried to compile using clang - but that had other issues on my end (couldn't find it's C++ header files like string).
I haven't actually tested echo server.
I also tried to compile using clang - but that had other issues on my end (couldn't find it's C++ header files like string).
Reinstalling libc++-dev may resolve this issue
If it would be a normal clang installation for the host itself then maybe - but in this case it's a cross platform tool chain (OSELAS-Toolchain). I'll look into what when wrong while building it another day.
Regarding this issue, have you tried compiling the library with g++-11? When working on a coroutine library I noticed that g++-10 had a number of small bugs in the way it handled coroutines. These bugs didn't appear in clang 10, but they were still annoying to work around.
I used g++-11 and it still segfault when running echo_server. Here's the backtrace
Core was generated by `./echo_server 8080'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 io_uring_get_sqe (ring=0x82ad5fcaf127a800) at queue.c:383
383 unsigned int head = io_uring_smp_load_acquire(sq->khead);
(gdb) bt
#0 io_uring_get_sqe (ring=0x82ad5fcaf127a800) at queue.c:383
#1 0x0000555555560bec in io_service::io_uring_get_sqe_safe (this=0x82ad5fcaf127a800) at /home/cream/src/liburing4cpp/include/io_service.hpp:759
#2 0x0000555555560598 in io_service::send (this=0x82ad5fcaf127a800, sockfd=5, buf=0x5555555b1060, nbytes=2, flags=16384, iflags=0 '\000')
at /home/cream/src/liburing4cpp/include/io_service.hpp:410
#3 0x000055555555d1bd in accept_connection(io_service&, int)::{lambda(int)#1}::operator() (frame_ptr=0x5555555b0fc0)
at /home/cream/src/liburing4cpp/demo/echo_server.cpp:75
#4 0x000055555555cbef in io_service::_ZN10io_service10await_workEP12io_uring_sqeh.actor(io_service::_ZN10io_service10await_workEP12io_uring_sqeh.frame *) (frame_ptr=0x5555555b0f30) at /home/cream/src/liburing4cpp/include/io_service.hpp:738
#5 0x000055555555e6ab in std::__n4861::coroutine_handle<void>::resume (this=0x5555555b0f88) at /usr/include/c++/11/coroutine:126
#6 0x0000555555563475 in promise<int, true>::resolve<int&, void> (this=0x5555555b0f80, u=@0x7ffff7fc4158: 0)
at /home/cream/src/liburing4cpp/include/promise.hpp:60
#7 0x000055555556211c in io_service::run<void, false> (this=0x7fffffffda10, t=...) at /home/cream/src/liburing4cpp/include/io_service.hpp:791
#8 0x000055555555de0a in main (argc=2, argv=0x7fffffffdc48) at /home/cream/src/liburing4cpp/demo/echo_server.cpp:104
@codeinred are you able to get them running? What compiler did you use?
I used g++-11 and it still segfault when running echo_server. Here's the backtrace
Core was generated by `./echo_server 8080'. Program terminated with signal SIGSEGV, Segmentation fault. #0 io_uring_get_sqe (ring=0x82ad5fcaf127a800) at queue.c:383 383 unsigned int head = io_uring_smp_load_acquire(sq->khead); (gdb) bt #0 io_uring_get_sqe (ring=0x82ad5fcaf127a800) at queue.c:383 #1 0x0000555555560bec in io_service::io_uring_get_sqe_safe (this=0x82ad5fcaf127a800) at /home/cream/src/liburing4cpp/include/io_service.hpp:759 #2 0x0000555555560598 in io_service::send (this=0x82ad5fcaf127a800, sockfd=5, buf=0x5555555b1060, nbytes=2, flags=16384, iflags=0 '\000') at /home/cream/src/liburing4cpp/include/io_service.hpp:410 #3 0x000055555555d1bd in accept_connection(io_service&, int)::{lambda(int)#1}::operator() (frame_ptr=0x5555555b0fc0) at /home/cream/src/liburing4cpp/demo/echo_server.cpp:75 #4 0x000055555555cbef in io_service::_ZN10io_service10await_workEP12io_uring_sqeh.actor(io_service::_ZN10io_service10await_workEP12io_uring_sqeh.frame *) (frame_ptr=0x5555555b0f30) at /home/cream/src/liburing4cpp/include/io_service.hpp:738 #5 0x000055555555e6ab in std::__n4861::coroutine_handle<void>::resume (this=0x5555555b0f88) at /usr/include/c++/11/coroutine:126 #6 0x0000555555563475 in promise<int, true>::resolve<int&, void> (this=0x5555555b0f80, u=@0x7ffff7fc4158: 0) at /home/cream/src/liburing4cpp/include/promise.hpp:60 #7 0x000055555556211c in io_service::run<void, false> (this=0x7fffffffda10, t=...) at /home/cream/src/liburing4cpp/include/io_service.hpp:791 #8 0x000055555555de0a in main (argc=2, argv=0x7fffffffdc48) at /home/cream/src/liburing4cpp/demo/echo_server.cpp:104
@codeinred are you able to get them running? What compiler did you use?
Seems the only working compiler is clang-9. The stack is cracked as long as mutiple coroutines run in parallel.
The crashes can be easily avoided by removing Line 24. In this case only one connection can be accepted at a time.
All implementations of C++20 coroutine libraries I can find on the Internet are lazy - the coroutine body wont be executed before it being co_awaited, which means coroutines can never be run in parallel without threads.
That makes io_uring mostly useless. I dont bother to do that.
I'd suggest that you'd better use stackful coroutines for now for both stablility and performance:
With regard to coroutine libraries being lazy: In the typical case, an executor can also start a coroutine before it's co_await'd on. So the coroutine is lazy because all the coroutines can be registered with an executor, before they're started.
As for allocations with promise objects: calling a function like read
or write
should just return an awaitable. These awaitables shouldn't allocate, but right now they're implemented as task
s in your library, which leads to unnecessary allocation and also may result in a race condition.
We can avoid this allocation entirely be changing the way value resolution occurs. Rather than a promise type having a resolve
function, we can write a really simple resolver like so:
struct cqe_resolver {
std::coroutine_handle<> handle;
int result {};
inline void resolve(int result) {
this->result = result;
handle.resume();
}
};
We can then write a non-allocating trivially destructible awaitable type that represents any io_uring
operation. I'm still studying liburing, so this may not be entirely correct, but it's what I understand from reading your code:
struct sqe_awaitable {
cqe_resolver resolver {};
io_uring_sqe* sqe = nullptr;
io_uring& ring;
int& cqe_count;
uint8_t iflags = 0;
// Create the awaitable. This doesn't actually register anything, as that
// occurs once the coroutine suspends. Doing it *before* the coroutine
// suspends has the potential to result in a race condition, but we can just
// register it inside await_suspend()
sqe_awaitable(io_uring& ring, int& cqe_count, uint8_t iflags = 0)
: ring(ring)
, cqe_count(cqe_count)
, iflags(iflags) {}
constexpr bool await_ready() const noexcept { return false; }
void await_suspend(std::coroutine_handle<> handle) {
sqe = io_uring_get_sqe(&ring);
if (!sqe) {
// SQ is full, flushing cqe(s)
io_uring_cq_advance(&ring, cqe_count);
cqe_count = 0;
io_uring_submit(&ring);
sqe = io_uring_get_sqe(&ring);
assert(sqe && "sqe should not be NULL");
}
// Register the handle with the resolver, so that it can be called
// once resolution occurs.
// The resolver contains the data needed to resume the coroutine once
// the result has been obtained. The coroutine will be resumed when
// resolver.resolve() is invoked with the io_uring_cqe::res value
resolver.handle = handle;
// Register the resolver with the io_uring Submission Query Entry data
// structure, so that it knows to resume that resolver
io_uring_sqe_set_flags(sqe, iflags);
io_uring_sqe_set_data(sqe, &resolver);
}
constexpr int await_resume() const noexcept { return resolver.result; }
void cancel() {
io_uring_prep_cancel(sqe, &resolver, 0);
}
};
Whenever you read
, write
, or do another async operation, it just creates and returns the above awaitable type on the stack (no allocation), which registers itself with the ring. No allocation occurs, and everyone's happy!
Coincidentally, this also happens to avoid the bug I mentioned above in gcc-10. That bug only applies when awaitables created in a co_await
statement have a non-trivial destructor, but because sqe_awaitable
is non-owning, it's trivially destructible.
Hi everyone,
Given the recent code cleanups, and the addition of sqe_awaitable, I'm not sure this bug still exists. I've attached an echo client I'm using to do the test below. When I run the echo server, I get the following output:
build/echo_server 1234
Listening: 1234
sockfd 5 is accepted; number of running coroutines: 1
sockfd 5 is closed; number of running coroutines: 0
sockfd 5 is accepted; number of running coroutines: 1
sockfd 5 is closed; number of running coroutines: 0
sockfd 5 is accepted; number of running coroutines: 1
sockfd 5 is closed; number of running coroutines: 0
This comes from running the echo client, which receives the response it's supposed to:
alecto@styx:liburing4cpp$ ./echo-client.py --port 1234
Connecting to localhost port 1234
Sending Test message. This will be echoed
Received: Test message. Th
Received: is will be echoe
Received: d
Closing connection to the server
alecto@styx:liburing4cpp$ ./echo-client.py --port 1234
Connecting to localhost port 1234
Sending Test message. This will be echoed
Received: Test message. Th
Received: is will be echoe
Received: d
Closing connection to the server
alecto@styx:liburing4cpp$ ./echo-client.py --port 1234
Connecting to localhost port 1234
Sending Test message. This will be echoed
Received: Test message. Th
Received: is will be echoe
Received: d
Closing connection to the server
The echo server accepts multiple successive incoming connections without issue.
Shown below is the code for the echo client, which I took from the python network programming cookbook:
#!/usr/bin/env python2
# coding=utf8
# Python Network Programming Cookbook -- Chapter – 1
# This program is optimized for Python 2.7. It may run on any
# other Python version with/without modifications.
import socket
import sys
import argparse
host = 'localhost'
def echo_client(port):
""" A simple echo client """
# Create a TCP/IP socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Connect the socket to the server
server_address = (host, port)
print "Connecting to %s port %s" % server_address
sock.connect(server_address)
# Send data
try:
# Send data
message = "Test message. This will be echoed"
print "Sending %s" % message
sock.sendall(message)
# Look for the response
amount_received = 0
amount_expected = len(message)
while amount_received < amount_expected:
data = sock.recv(16)
amount_received += len(data)
print "Received: %s" % data
except socket.errno, e:
print "Socket error: %s" %str(e)
except Exception, e:
print "Other exception: %s" %str(e)
finally:
print "Closing connection to the server"
sock.close()
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Socket Server Example')
parser.add_argument('--port', action="store", dest="port", type=int, required=True)
given_args = parser.parse_args()
port = given_args.port
echo_client(port)
Could I have some guidance on how to recreate this bug, if it still exists? I also tried running the echo server with the memory sanitizer enabled, and in valgrind, and both of those were happy with it.
Thanks for the effort! @codeinred I can still reproduce the bug using g++-11 after reverting the temporary workaround introduced here https://github.com/CarterLi/liburing4cpp/commit/324f21c3889c546e8d203ef5057926bb02c21810#diff-df6aad85015654783d708d610b1aec0ede8a3c8e29cf0057f3c7dd4ede9cbfcd
I see, thank you! This gives me a good starting point!
I think maybe the problem before workaround occurs because of the lambda coroutine ?
https://github.com/CarterLi/liburing4cpp/blob/a9b422baaf4d5c587cbdc95379a1e2e145bddeb5/demo/echo_server.cpp#L24 Before the workaround, in this line, the lambda object do capture 2 variables, but soon after the lambda expires it's lifetime when it was called then suspended, the inner while loop scope exit, the object might be gone. The same thing just happends in the file_server.
Based on the configuration and work around from #8 I get a SIGSEGV:
This looks to me that the coroutine created in https://github.com/CarterLi/liburing4cpp/blob/async/demo/file_server.cpp#L102 got destroyed before the code reaches https://github.com/CarterLi/liburing4cpp/blob/async/demo/file_server.cpp#L114.