darlinghq / darling

Darwin/macOS emulation layer for Linux
http://www.darlinghq.org
GNU General Public License v3.0
11.43k stars 442 forks source link

Tokio test failures #766

Open bugaevc opened 4 years ago

bugaevc commented 4 years ago

Good news is most test pass!

test remove ... FAILED

failures:

---- remove stdout ----
thread 'remove' panicked at 'failed to park', /private/tmp/tokio/tokio/src/runtime/basic_scheduler.rs:158:29
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.44/src/backtrace/libunwind.rs:86
   1: backtrace::backtrace::trace_unsynchronized
             at /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.44/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:78
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1063
   5: std::io::Write::write_fmt
             at /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libstd/io/mod.rs:1426
   6: std::io::impls::<impl std::io::Write for alloc::boxed::Box<W>>::write_fmt
             at src/libstd/io/impls.rs:156
   7: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   8: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   9: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:204
  10: std::panicking::default_hook
             at src/libstd/panicking.rs:221
  11: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:470
  12: rust_begin_unwind
             at src/libstd/panicking.rs:378
  13: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
  14: core::option::expect_failed
             at src/libcore/option.rs:1203
  15: core::option::Option<T>::expect
             at /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libcore/option.rs:347
  16: tokio::runtime::basic_scheduler::BasicScheduler<P>::block_on::{{closure}}
             at ./src/runtime/basic_scheduler.rs:158
  17: tokio::runtime::basic_scheduler::enter::{{closure}}
             at ./src/runtime/basic_scheduler.rs:213
  18: tokio::macros::scoped_tls::ScopedKey<T>::set
             at ./src/macros/scoped_tls.rs:64
  19: tokio::runtime::basic_scheduler::enter
             at ./src/runtime/basic_scheduler.rs:213
  20: tokio::runtime::basic_scheduler::BasicScheduler<P>::block_on
             at ./src/runtime/basic_scheduler.rs:123
  21: tokio::runtime::Runtime::block_on::{{closure}}
             at ./src/runtime/mod.rs:444
  22: tokio::runtime::context::enter
             at ./src/runtime/context.rs:72
  23: tokio::runtime::handle::Handle::enter
             at ./src/runtime/handle.rs:71
  24: tokio::runtime::Runtime::block_on
             at ./src/runtime/mod.rs:441
  25: fs_dir::remove
             at tokio/tests/fs_dir.rs:31
  26: fs_dir::remove::{{closure}}
             at tokio/tests/fs_dir.rs:31
  27: core::ops::function::FnOnce::call_once
             at /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libcore/ops/function.rs:232
  28: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
             at /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/liballoc/boxed.rs:1017
  29: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:86
  30: std::panicking::try
             at /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libstd/panicking.rs:281
  31: std::panic::catch_unwind
             at /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libstd/panic.rs:394
  32: test::run_test_in_process
             at src/libtest/lib.rs:542
  33: test::run_test::run_test_inner::{{closure}}
             at src/libtest/lib.rs:451
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Steps to reproduce follow.

  1. Install Rust via rustup:
$ curl -L sh.rustup.rs > rustup-install.sh
$ sh rustup-install.sh -y
  1. Clone Tokio:

    $ git clone https://github.com/tokio-rs/tokio.git
  2. Build it & run the tests:

$ cd tokio
$ cargo test
bjorn3 commented 4 years ago

I think this function returned an error:

https://github.com/tokio-rs/tokio/blob/57ba37c97854d32e691ea68006c8d69d58c79b23/tokio/src/time/driver/mod.rs#L239-L268

bjorn3 commented 4 years ago

$ cargo build

That should be cargo test.

bugaevc commented 4 years ago

That should be cargo test.

Right, fixed.

bugaevc commented 4 years ago

Can I ask you, how did you find this issue? :smile:

bjorn3 commented 4 years ago

I looked the line mentioned at

 16: tokio::runtime::basic_scheduler::BasicScheduler<P>::block_on::{{closure}}
             at ./src/runtime/basic_scheduler.rs:158

this was a park call followed by an unwrap. This means that the park call was the one returning an error. I then looked up the backtrace to search for the correct type. By the way I looked a bit more, and I think I found the wrong Park implementation. It seems that all Park implementations that may be relevant are called Driver.

I think it is more likely that it actually was https://github.com/tokio-rs/tokio/blob/d8139fef7a555f99adaa37e139787d26e24daba3/tokio/src/io/driver/mod.rs#L175-L178, in which case it would have been a poll call that returned an error: https://github.com/tokio-rs/tokio/blob/d8139fef7a555f99adaa37e139787d26e24daba3/tokio/src/io/driver/mod.rs#L107. This poll function is defined at https://github.com/tokio-rs/mio/blob/976f2354d0e8fbbb64fba3bf017d7131f9c369a0/src/poll.rs#L311. This indirectly calls the kevent syscall: https://github.com/tokio-rs/mio/blob/976f2354d0e8fbbb64fba3bf017d7131f9c369a0/src/sys/unix/selector/kqueue.rs#L109-L116.

Maybe kevent is not yet fully implemented?

bugaevc commented 4 years ago

I looked the line mentioned at

Ah, no, that's not what I meant. That's easy to find indeed.

I mean, I didn't expect to see you show up and start helping us debug this issue within half an hour of me filing it. But I see you've already been using Darling & commenting here in the past, cool.

bjorn3 commented 4 years ago

Actually I haven't used Darling :) It is just a nice project I may want to try in the future.

bugaevc commented 4 years ago

Maybe kevent is not yet fully implemented?

It's very likely to have issues, yes.

So, I built mio, which has way less tests, and this fails:

test events_all ... FAILED

failures:

---- events_all stdout ----
thread 'events_all' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" }', tests/events.rs:26:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

But other times this fails:

test sys::unix::selector::kqueue::does_not_register_rw ... FAILED

failures:

---- sys::unix::selector::kqueue::does_not_register_rw stdout ----
thread 'sys::unix::selector::kqueue::does_not_register_rw' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 9, kind: Other, message: "Bad file descriptor" }', src/sys/unix/selector/kqueue.rs:674:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
bugaevc commented 4 years ago

kevent() call here seems to indeed return Os { code: 1, kind: PermissionDenied, message: "Operation not permitted" }

bugaevc commented 4 years ago

So first of all kevent()/kqueue() are returning their errors wrong. libkqueue sets errno and returns -1, but our syscall emulation layer has to return -error instead. After fixing that, that error turns into Os { code: 2, kind: NotFound, message: "No such file or directory" }, so, an ENOENT. Side note: ENOENT is not a correct error to return when something that's not a file is not found, but kevent apparently does.

Now it would seem ENOENT is coming from here:

https://github.com/darlinghq/darling-libkqueue/blob/2c2d5cfd06c10ca6d4ede0416076552a5e356722/src/common/kevent.c#L336-L341

and I see this in strace log:

[pid 160487] epoll_create(1)            = 3
[pid 160487] ioctl(1023, _IOC(_IOC_NONE, 0x10, 0x35, 0), 0) = -1 EINVAL (Invalid argument)
[pid 160487] fcntl(3, F_SETFD, FD_CLOEXEC) = 0
[pid 160487] dup(3)                     = 4
[pid 160487] close(4)                   = 0
[pid 160487] gettid()                   = 1834
[pid 160487] close(3)                   = 0
[pid 160487] close(3)                   = -1 EBADF (Bad file descriptor)

it closes the kqueue — actually, twice! (which is an issue, because it might accidentally close another thread's fd).

bugaevc commented 4 years ago

kqueue debug log:

KQ [1848]: kqueue_impl(): created kqueue, fd=3
KQ [1848]: map_insert(): inserted 0x7f83a8802c00 in location 3
KQ [1848]: kevent64_impl(): --- kevent 1 --- (nchanges = 2, nevents = 0)
KQ [1848]: kevent64_impl(): waiting for &(kq)->kq_mtx
KQ [1848]: kevent64_impl(): locked &(kq)->kq_mtx
KQ [1848]: kevent_copyin(): nchanges=2 nevents=0
KQ [1848]: kevent_copyin_one(): src={ ident=5, filter=-1 (EVFILT_READ), flags=0x0002 (EV_DELETE), fflags=0x0100 (), data=515, udata=100000000000000 }
KQ [1848]: knote_lookup(): id=5 ent=0x0
KQ [1848]: kevent_copyin_one(): knote_lookup: ident 5 == 0x0
KQ [1848]: kevent_copyin_one(): no entry found for ident=5
KQ [1848]: kevent_copyin(): errno=No such file or directory
KQ [1848]: kevent64_impl(): unlocked &(kq)->kq_mtx
KQ [1848]: kevent64_impl(): (1) changelist: rv=-1
KQ [1848]: kevent64_impl(): --- END kevent 1 ret -1 ---

so maybe I'm looking in the wrong place.

bugaevc commented 4 years ago

OK, I know what this is! :bulb: (The failing Mio test, at least).

Waker::new() calls selector.try_clone(), which dupes the kqueue fd. Then of course libkqueue fails to look up the kqueue by the new fd.

We need to pipe dup() and dup2() through libkqueue like we do for close(). cc @LubosD

bugaevc commented 4 years ago

Ok, that mio test now masses. Next up:

running 13 tests
test is_send_and_sync ... ok
test poll_erroneous_registration ... ok
test poll_registration ... ok
test poll_closes_fd ... FAILED
thread '<unnamed>' panicked at 'set_read_timeout: Os { code: 33, kind: Other, message: "Numerical argument out of domain" }', tests/poll.rs:124:9
test add_then_drop ... FAILED
bjorn3 commented 4 years ago

https://doc.rust-lang.org/stable/std/net/struct.TcpStream.html#method.set_read_timeout

https://github.com/rust-lang/rust/blob/ce1ab355c27faa84f984b15800cd6a65e674b9f2/src/libstd/sys_common/net.rs#L241

https://github.com/rust-lang/rust/blob/ce1ab355c27faa84f984b15800cd6a65e674b9f2/src/libstd/sys/unix/net.rs#L292

https://github.com/rust-lang/rust/blob/ce1ab355c27faa84f984b15800cd6a65e674b9f2/src/libstd/sys_common/net.rs#L61

The problem seems to be in setsockopt with SO_RCVTIMEO.

LubosD commented 4 years ago

Ah yes, that must be because struct timeval's layout differs between Linux and macOS.

bugaevc commented 4 years ago

Status update: I've found lots more issues with libkqueue; I've tried to fix some of them, but the fixes introduce more hacks, and some issues seem impossible to fix in userspace. Discussed it with @LubosD, he agrees that it would make sense to drop linkqueue and implement kqueue in kernel space, and even has ideas about implementing more integration between Linux and XNU wait queues.