Crash on fail to spawn thread #163

Closed zeenix closed 1 year ago

In GitLab by @fluke on Apr 7, 2021, 03:44

asusd is a long running process and can receive many messages over time. The BT is:

thread 'main' panicked at 'failed to spawn thread: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }', /home/luke/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:620:29
stack backtrace:
   0:     0x557e9a2de740 - std::backtrace_rs::backtrace::libunwind::trace::h4dee703919bfd40a
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5
   1:     0x557e9a2de740 - std::backtrace_rs::backtrace::trace_unsynchronized::h457e839f1a563e20
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x557e9a2de740 - std::sys_common::backtrace::_print_fmt::h86a55fb30f8393c8
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x557e9a2de740 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h7b3d6cac46d277e1
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/sys_common/backtrace.rs:46:22
   4:     0x557e9a30158f - core::fmt::write::h127419eb46f2ecc9
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/core/src/fmt/mod.rs:1092:17
   5:     0x557e9a2dae15 - std::io::Write::write_fmt::h6010cfbb4726588b
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/io/mod.rs:1578:15
   6:     0x557e9a2e07bb - std::sys_common::backtrace::_print::h79b4f9652330cc9d
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/sys_common/backtrace.rs:49:5
   7:     0x557e9a2e07bb - std::sys_common::backtrace::print::h330bb326a76af8cf
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/sys_common/backtrace.rs:36:9
   8:     0x557e9a2e07bb - std::panicking::default_hook::{{closure}}::heb6a42a7d50a472e
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/panicking.rs:208:50
   9:     0x557e9a2e0283 - std::panicking::default_hook::h17e521ba6d68d6e1
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/panicking.rs:225:9
  10:     0x557e9a2e0edd - std::panicking::rust_panic_with_hook::h70db735e3a6e70cb
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/panicking.rs:591:17
  11:     0x557e9a2e0aa7 - std::panicking::begin_panic_handler::{{closure}}::h777c71c8e5a7e25c
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/panicking.rs:497:13
  12:     0x557e9a2debfc - std::sys_common::backtrace::__rust_end_short_backtrace::h3e9bf30168899554
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/sys_common/backtrace.rs:141:18
  13:     0x557e9a2e0a09 - rust_begin_unwind
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/panicking.rs:493:5
  14:     0x557e9a2ffcf1 - core::panicking::panic_fmt::h5322a082d19786c3
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/core/src/panicking.rs:92:14
  15:     0x557e9a2ffb13 - core::option::expect_none_failed::h354eaa93a51d71b8
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/core/src/option.rs:1329:5
  16:     0x557e9a19d4dd - core::result::Result<T,E>::expect::heb48d018268ee0e4
  17:     0x557e9a19a7a0 - std::thread::spawn::h074d6679c0772f47
  18:     0x557e9a1adafd - zbus::azync::connection::Connection::receive_specific::{{closure}}::h6ec67f88432c7119
  19:     0x557e9a1abb4e - <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::hbc88bc08143b0c69
  20:     0x557e9a17e4d9 - async_io::driver::block_on::hc328be5aaeb50c39
  21:     0x557e9a1ac867 - zbus::connection::Connection::receive_message::h784628813309cd84
  22:     0x557e9a243b72 - zbus::object_server::ObjectServer::try_handle_next::h9fe0a9876df18e30
  23:     0x557e9a002587 - asusd::start_daemon::ha0f714b4a9686382
  24:     0x557e9a001373 - asusd::main::hf26ceba184a6b6b3
  25:     0x557e9a003e06 - core::ops::function::FnOnce::call_once::h5d3027d6b19c08dd
  26:     0x557e9a003d1c - std::sys_common::backtrace::__rust_begin_short_backtrace::h089ee6c3e397d8e2
  27:     0x557e9a005b5c - std::rt::lang_start::{{closure}}::h55e0553332b6cb66
  28:     0x557e9a2e12da - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::hbcc915e668c7ca11
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/core/src/ops/function.rs:259:13
  29:     0x557e9a2e12da - std::panicking::try::do_call::h6b0f430d48122ddf
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/panicking.rs:379:40
  30:     0x557e9a2e12da - std::panicking::try::h6ba420e2e21b5afa
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/panicking.rs:343:19
  31:     0x557e9a2e12da - std::panic::catch_unwind::h8366719d1f615eee
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/panic.rs:431:14
  32:     0x557e9a2e12da - std::rt::lang_start_internal::h965c28c9ce06ee73
                               at /rustc/07e0e2ec268c140e607e1ac7f49f145612d0f597/library/std/src/rt.rs:51:25
  33:     0x557e9a005b4e - std::rt::lang_start::h05154b4d3c2b5a87
  34:     0x557e9a002ec3 - main
  35:     0x7fea21c4fb75 - <unknown>
  36:     0x557e99ffd2ee - _start
  37:                0x0 - <unknown>

Where the lines:

  16:     0x557e9a19d4dd - core::result::Result<T,E>::expect::heb48d018268ee0e4
  17:     0x557e9a19a7a0 - std::thread::spawn::h074d6679c0772f47
  18:     0x557e9a1adafd - zbus::azync::connection::Connection::receive_specific::{{closure}}::h6ec67f88432c7119
  19:     0x557e9a1abb4e - <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::hbc88bc08143b0c69
  20:     0x557e9a17e4d9 - async_io::driver::block_on::hc328be5aaeb50c39
  21:     0x557e9a1ac867 - zbus::connection::Connection::receive_message::h784628813309cd84

indicate that zbus isn't handling a failure in a way that will prevent crashing.

In GitLab by @fluke on Apr 7, 2021, 03:59

Extra detail:

zbus = "^2.0.0-beta.3"
zvariant = "^2.5"
rust 1.51 + 1.53

Unsure if this is a rust fault, or if there's a way zbus can catch and handle it. I see there's a thread spawn here which ignores the handle.. Would it be better to keep the handle and check it for error condition then retry?

Thanks for the report.

indicate that zbus isn't handling a failure in a way that will prevent crashing.

But the crash seems to be happening during a thread::spawn call so how would us checking the handle, help? :thinking: At first look, this seems like a std bug.

Ah, we should be using thread::Builder if we want to catch the error, instead of panicking in this case. I wonder if we can do anything better then throwing an error though. :thinking:

thread 'main' panicked at 'failed to spawn thread: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }'

Also wonder how we can help this actual problem. We only launch a single thread per connection and even that only on demand and stop it when it's no longer needed. Do you happen to have many connections?

In GitLab by @fluke on Apr 7, 2021, 13:52

Actually that's a good point. I have a second daemon which handles notifications in a loop calling next_signal.

I see next_signal calls receive_specific which creates a thread, does this signal check then cause the daemon to spawn another thread? It looks a bit like if receive_specific is called multiple times on the server side there is nothing actually preventing multiples of threads being spawned (given that it's an async function). Normally in my own projects I either store a handle or I give the thread a channel to keep which can be used to make the thread exit. A possible solve is to keep a max count and return error or force a thread to exit.

To be clear I think the notification check loop in one user daemon is causing the thread panic in the system daemon.

I see next_signal calls receive_specific which creates a thread, does this signal check then cause the daemon to spawn another thread?

That's correct.

It looks a bit like if receive_specific is called multiple times on the server side there is nothing actually preventing multiples of threads being spawned (given that it's an async function).

We're actually only launching the thread if it's the first caller to receive_specific and the counter for the callers (i-e receivers) is atomic so I'm not sure if we can end up multiple calls of receive_specific launching > 1 producer threads.

or I give the thread a channel to keep which can be used to make the thread exit.

That part, we're already doing as well.

Oh and you should be able to see multiple threads in gdb backtrace, as well?

In GitLab by @fluke on Apr 7, 2021, 14:28

I wasn't able to due to the OS error.

Sorry about not seeing the things you mention above, I'm actually on my phone in bed 🙃. I will see if I can figure out a bit more tomorrow, I'll run the daemon with a local zbus repo peppered with some dbg outputs.

mentioned in commit 32cf9a28c43983f93d63e39e6130f7f181598673

@fluke !281 takes care of the panic. Merging that will close this issue but if you do find out that we're indeed launching more than one thread here, please do reopen this issue (or create a new one). !281 also names the thread now so it should easier to find out.

In GitLab by @fluke on Apr 8, 2021, 01:30

Can you possibly keep a changelog so that changes can be enumerated and explained? For users updating from 1.9 to 2.0 this might be good?

I've built the daemon using your thread-fix branch, will post results in a few hours.

In GitLab by @fluke on Apr 8, 2021, 03:29

Confirmed working fix @zeenix. I certainly doesn't hold threads, but it does spawn threads constantly which eventually triggers the error. I'm not 100% certain yet, but when this error happens I think I ended up with a bunch of dangling threads - running my test again.

It might be worth having connection keep a single thread open for use by receive_specific. Especially since I've now created and closed 22,000 threads in the space of me typing this (5 minutes or so), all from notification checks spawning them in the server.

In GitLab by @fluke on Apr 8, 2021, 03:48

Okay yes, looks like I do get a sudden pile of dangling threads from that.

Can you possibly keep a changelog so that changes can be enumerated and explained?

Mate! I (unlike many other crate maintainers) put in the time to write release notes manually. I probably should also have a NEWS file in the repo but that's then yet another manual labour for me.

For users updating from 1.9 to 2.0 this might be good?

As you may have noticed, 2.0 is in beta currently and can break and change. I intend to write a porting guide (perhaps as part of the book) once 2.0 is stable.

Confirmed working fix @zeenix. I certainly doesn't hold threads, but it does spawn threads constantly which eventually triggers the error. I'm not 100% certain yet, but when this error happens I think I ended up with a bunch of dangling threads - running my test again.

Oh ok. I wonder if threads dangle if nobody joins them? I'll check that.

It might be worth having connection keep a single thread open for use by receive_specific

Right, at least as a short-term solution for this. I just don't like the idea of zbus running a thread in the first place and it just seems much worse if it does that permanantely even when it's not needed. However, I'm ok with doing just that as a short term solution if I couldn't find a better solution.

Long term, this should really be a task but who runs it if we don't want to depend on a async specific-runtime in zbus? Perhaps (the consumer side of) receive_specific can do that when it waits for new messages?

mentioned in commit 5c17e289fbcfc23c0bd1d82f6a86e1c898778027

In GitLab by @fluke on Apr 8, 2021, 13:44

Oh ok. I wonder if threads dangle if nobody joins them? I'll check that. Seems likely. I'm a bit too tired right now but perhaps a VecDeque might be a decent solution, then every iteration check them, if the count is low it's negilible overhead.

Right, at least as a short-term solution for this. I just don't like the idea of zbus running a thread in the first place and it just seems much worse if it does that permanantely even when it's not needed. However, I'm ok with doing just that as a short term solution if I couldn't find a better solution.

I think owning a single thread vs spawning a potentially thousands of threads each with a heavy (IOCTL?) cost is the question. We know for certain in my use case that there is tens of thousands of threads spawning and ending, and this is going to be on well over 500 laptops (that I'm aware of so far) with potential to run for days if not weeks.

I think owning a single thread vs spawning a potentially thousands of threads

Given that choice, yes I agree completely. :) The spawning of thousands of threads is not intentional at all though so if that bug could be resolved, we're not dealing with this choice anymore.

Since you're able to reproduce, can you please check how we end up with more than one thread?

In GitLab by @fluke on Apr 8, 2021, 21:22

That's the thing though. The code you have is correct and works correctly. But the issue is that

my client app loop checks for a notification,
that triggers the thread spawn,
notification completes
thread ends
back to first step

The existing code prevents many consumers at once from starting many threads, but does prevent the scenario I have.

So I think the only solution here is to own the thread... but yeah spooling and async etc is a problem. I think it would be entirely reasonable to chuck a short sleep in the loop.

In GitLab by @fluke on Apr 8, 2021, 22:01

Just spitballing here, but what if the consumers lock is moved inside an owned thread and used to select if the thread spools with a sleep or skips the sleep?

my client app loop checks for a notification,

that triggers the thread spawn,

notification completes

thread ends

back to first step

The existing code prevents many consumers at once from starting many threads, but does prevent the scenario I have.

I'm not sure I follow now. You say that the thread correctly ends? Is the problem that thread is launched all the time and not that we end up launching multiple threads (at the same time)? I understood the issue was latter:

but when this error happens I think I ended up with a bunch of dangling threads.

Just spitballing here, but what if the consumers lock is moved inside an owned thread and used to select if the thread spools with a sleep or skips the sleep?

Not sure I follow (some code sample would help) but in general I wouldn't be excited about adding sleeps in a library.

Having said that, I'm leaning towards having a permanant thread running for each connection, with a FIXME: Ideally this should be an async task comment.

In GitLab by @fluke on Apr 9, 2021, 23:07

I'm not sure I follow now. You say that the thread correctly ends? Is the problem that thread is launched all the time and not that we end up launching multiple threads (at the same time)? I understood the issue was latter:

Sorry I wasn't very clear. Yes the behaviour is correct now. Except there are two issues now, one is a result of the other:

Scenario:

Server is running
Client is running, and checking notifications in an endless loop

The behaviour that happens with your branch + previous:

Every time the client checks for notifications with next_signal() the server calls receive_specific(), which then creates a thread for the work.

This part is fine, it behaves as you'd expect with and without the thread spawn method change. But...

The client is calling next_signal() in an endless loop (with a 100ms sleep to prevent CPU maxing)

This means that in the server a thread is created and ended every 100ms. Eventually the OS says "No you can't have a thread" causing the root issue of this bug report.

Now there's a new issue resulting from the constant thread spawning - eventually I get a small bundle of dangling threads. I suspect this happens at the "Computer says no" stage of step 2, but I never got an error in my log so I can't be sure.

A code sample is real easy - create a client, and server with a signal, have the client check for signal in a loop. On my system it can take an hour or more with the 100ms sleep, so maybe with a smaller period you'll hit the issue sooner.

@fluke thanks for explaining in detail. :thumbsup:

This means that in the server a thread is created and ended every 100ms. Eventually the OS says "No you can't have a thread" causing the root issue of this bug report.

Right, so somehow it is the creation of threads in high frequency that is an issue. Interesting. Anyway, I'll (reluctantly) make the thread run per connection. Hopefully people don't use thousands of connections. :)

A code sample is real easy

Great. Could you please write it while I address the issue? Hopefully I can turn it into a testcase (unless it needs more than 5 seconds to run).

mentioned in commit fcaac3e16c7263b1350e39ab94eeb84f79f52644

Anyway, I'll (reluctantly) make the thread run per connection.

Oh that turned out to be much harder than I thought. :smiley_cat: I'll try another approach..

In GitLab by @fluke on Apr 11, 2021, 23:30

Great. Could you please write it while I address the issue? Hopefully I can turn it into a testcase (unless it needs more than 5 seconds to run).

Sorry, it needs over 30 minutes. No guarantee it'll fail in a container either.

Sorry, it needs over 30 minutes. No guarantee it'll fail in a container either.

Shame but please do provide the code if you can, so I can at least test my fix locally.

mentioned in commit 8b76e6b89072830a64004bb96aedcad869f991a8

There, had to do quite a bit of surgery and rework but I believe !283 solves your problem, along with another issue I really wanted to solve. Please give it a spin. I think the only way you'd be able to reproduce the issue still if you create (clones don't count) and destroy Connection all the time. :)

In GitLab by @fluke on Apr 13, 2021, 02:08

I'll try to get this done in the next day or so.

In GitLab by @fluke on Apr 13, 2021, 02:08

Nice little chunk of work! Testing now, will report back in a few hours.

In GitLab by @fluke on Apr 13, 2021, 02:21

CPU use seems to be lower too. From 0.25% to 0.05%. This is making about a 1w difference in idle power consumption on battery (your work has lowered it!).

So even with notification check every 10ms, it's made a huge difference. As I understand it, creating a thread is costly because you end up calling a syscall or two for each thread?

In GitLab by @fluke on Apr 13, 2021, 04:30

Nearly 3 hours without flaw so far. I'd say you're good to merge this. Looking forward to new release!

Awesome! Yeah, thread launching on its own is pretty cheap but not free as it involves context switching etc so if you do it very often, the accumulated cost can get prety significant. Thanks for checking the power consumption as well, that's helpful information.

mentioned in commit 984428d42098e7ffefdfe67bef36c11a1fbdde8d

Nearly 3 hours without flaw so far. I'd say you're good to merge this. Looking forward to new release!

cool. I cleaned-up the last patch a little and now I'll merge.

mentioned in commit db96ead341876b210e340a3306d61611d5a04444

dbus2 / zbus-old

Crash on fail to spawn thread #163