Open bugaevc opened 4 years ago
I think this function returned an error:
$ cargo build
That should be cargo test
.
That should be
cargo test
.
Right, fixed.
Can I ask you, how did you find this issue? :smile:
I looked the line mentioned at
16: tokio::runtime::basic_scheduler::BasicScheduler<P>::block_on::{{closure}}
at ./src/runtime/basic_scheduler.rs:158
this was a park
call followed by an unwrap
. This means that the park
call was the one returning an error. I then looked up the backtrace to search for the correct type. By the way I looked a bit more, and I think I found the wrong Park
implementation. It seems that all Park
implementations that may be relevant are called Driver
.
I think it is more likely that it actually was https://github.com/tokio-rs/tokio/blob/d8139fef7a555f99adaa37e139787d26e24daba3/tokio/src/io/driver/mod.rs#L175-L178, in which case it would have been a poll
call that returned an error: https://github.com/tokio-rs/tokio/blob/d8139fef7a555f99adaa37e139787d26e24daba3/tokio/src/io/driver/mod.rs#L107. This poll
function is defined at https://github.com/tokio-rs/mio/blob/976f2354d0e8fbbb64fba3bf017d7131f9c369a0/src/poll.rs#L311. This indirectly calls the kevent
syscall: https://github.com/tokio-rs/mio/blob/976f2354d0e8fbbb64fba3bf017d7131f9c369a0/src/sys/unix/selector/kqueue.rs#L109-L116.
Maybe kevent
is not yet fully implemented?
I looked the line mentioned at
Ah, no, that's not what I meant. That's easy to find indeed.
I mean, I didn't expect to see you show up and start helping us debug this issue within half an hour of me filing it. But I see you've already been using Darling & commenting here in the past, cool.
Actually I haven't used Darling :) It is just a nice project I may want to try in the future.
Maybe kevent is not yet fully implemented?
It's very likely to have issues, yes.
So, I built mio, which has way less tests, and this fails:
test events_all ... FAILED
failures:
---- events_all stdout ----
thread 'events_all' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" }', tests/events.rs:26:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
But other times this fails:
test sys::unix::selector::kqueue::does_not_register_rw ... FAILED
failures:
---- sys::unix::selector::kqueue::does_not_register_rw stdout ----
thread 'sys::unix::selector::kqueue::does_not_register_rw' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 9, kind: Other, message: "Bad file descriptor" }', src/sys/unix/selector/kqueue.rs:674:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
kevent()
call here seems to indeed return Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" }
So first of all kevent()
/kqueue()
are returning their errors wrong. libkqueue sets errno and returns -1, but our syscall emulation layer has to return -error
instead. After fixing that, that error turns into Os { code: 2, kind: NotFound, message: "No such file or directory" }
, so, an ENOENT
. Side note: ENOENT
is not a correct error to return when something that's not a file is not found, but kevent
apparently does.
Now it would seem ENOENT
is coming from here:
and I see this in strace
log:
[pid 160487] epoll_create(1) = 3
[pid 160487] ioctl(1023, _IOC(_IOC_NONE, 0x10, 0x35, 0), 0) = -1 EINVAL (Invalid argument)
[pid 160487] fcntl(3, F_SETFD, FD_CLOEXEC) = 0
[pid 160487] dup(3) = 4
[pid 160487] close(4) = 0
[pid 160487] gettid() = 1834
[pid 160487] close(3) = 0
[pid 160487] close(3) = -1 EBADF (Bad file descriptor)
it closes the kqueue — actually, twice! (which is an issue, because it might accidentally close another thread's fd).
kqueue debug log:
KQ [1848]: kqueue_impl(): created kqueue, fd=3
KQ [1848]: map_insert(): inserted 0x7f83a8802c00 in location 3
KQ [1848]: kevent64_impl(): --- kevent 1 --- (nchanges = 2, nevents = 0)
KQ [1848]: kevent64_impl(): waiting for &(kq)->kq_mtx
KQ [1848]: kevent64_impl(): locked &(kq)->kq_mtx
KQ [1848]: kevent_copyin(): nchanges=2 nevents=0
KQ [1848]: kevent_copyin_one(): src={ ident=5, filter=-1 (EVFILT_READ), flags=0x0002 (EV_DELETE), fflags=0x0100 (), data=515, udata=100000000000000 }
KQ [1848]: knote_lookup(): id=5 ent=0x0
KQ [1848]: kevent_copyin_one(): knote_lookup: ident 5 == 0x0
KQ [1848]: kevent_copyin_one(): no entry found for ident=5
KQ [1848]: kevent_copyin(): errno=No such file or directory
KQ [1848]: kevent64_impl(): unlocked &(kq)->kq_mtx
KQ [1848]: kevent64_impl(): (1) changelist: rv=-1
KQ [1848]: kevent64_impl(): --- END kevent 1 ret -1 ---
so maybe I'm looking in the wrong place.
OK, I know what this is! :bulb: (The failing Mio test, at least).
Waker::new()
calls selector.try_clone()
, which dupes the kqueue fd. Then of course libkqueue fails to look up the kqueue by the new fd.
We need to pipe dup()
and dup2()
through libkqueue like we do for close()
. cc @LubosD
Ok, that mio test now masses. Next up:
running 13 tests
test is_send_and_sync ... ok
test poll_erroneous_registration ... ok
test poll_registration ... ok
test poll_closes_fd ... FAILED
thread '<unnamed>' panicked at 'set_read_timeout: Os { code: 33, kind: Other, message: "Numerical argument out of domain" }', tests/poll.rs:124:9
test add_then_drop ... FAILED
https://doc.rust-lang.org/stable/std/net/struct.TcpStream.html#method.set_read_timeout
The problem seems to be in setsockopt
with SO_RCVTIMEO
.
Ah yes, that must be because struct timeval
's layout differs between Linux and macOS.
Status update: I've found lots more issues with libkqueue; I've tried to fix some of them, but the fixes introduce more hacks, and some issues seem impossible to fix in userspace. Discussed it with @LubosD, he agrees that it would make sense to drop linkqueue and implement kqueue in kernel space, and even has ideas about implementing more integration between Linux and XNU wait queues.
Good news is most test pass!
Steps to reproduce follow.
Clone Tokio:
Build it & run the tests: