Closed zeenix closed 1 year ago
In GitLab by @fluke on Apr 7, 2021, 03:59
Extra detail:
Unsure if this is a rust fault, or if there's a way zbus can catch and handle it. I see there's a thread spawn here which ignores the handle.. Would it be better to keep the handle and check it for error condition then retry?
Thanks for the report.
indicate that zbus isn't handling a failure in a way that will prevent crashing.
But the crash seems to be happening during a thread::spawn
call so how would us checking the handle, help? :thinking: At first look, this seems like a std
bug.
Ah, we should be using thread::Builder
if we want to catch the error, instead of panicking in this case. I wonder if we can do anything better then throwing an error though. :thinking:
thread 'main' panicked at 'failed to spawn thread: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }'
Also wonder how we can help this actual problem. We only launch a single thread per connection and even that only on demand and stop it when it's no longer needed. Do you happen to have many connections?
In GitLab by @fluke on Apr 7, 2021, 13:52
Actually that's a good point. I have a second daemon which handles notifications in a loop calling next_signal.
I see next_signal calls receive_specific which creates a thread, does this signal check then cause the daemon to spawn another thread? It looks a bit like if receive_specific is called multiple times on the server side there is nothing actually preventing multiples of threads being spawned (given that it's an async function). Normally in my own projects I either store a handle or I give the thread a channel to keep which can be used to make the thread exit. A possible solve is to keep a max count and return error or force a thread to exit.
To be clear I think the notification check loop in one user daemon is causing the thread panic in the system daemon.
I see next_signal calls receive_specific which creates a thread, does this signal check then cause the daemon to spawn another thread?
That's correct.
It looks a bit like if receive_specific is called multiple times on the server side there is nothing actually preventing multiples of threads being spawned (given that it's an async function).
We're actually only launching the thread if it's the first caller to receive_specific
and the counter for the callers (i-e receivers) is atomic so I'm not sure if we can end up multiple calls of receive_specific
launching > 1 producer threads.
or I give the thread a channel to keep which can be used to make the thread exit.
Oh and you should be able to see multiple threads in gdb backtrace, as well?
In GitLab by @fluke on Apr 7, 2021, 14:28
I wasn't able to due to the OS error.
Sorry about not seeing the things you mention above, I'm actually on my phone in bed 🙃. I will see if I can figure out a bit more tomorrow, I'll run the daemon with a local zbus repo peppered with some dbg outputs.
mentioned in commit 32cf9a28c43983f93d63e39e6130f7f181598673
@fluke !281 takes care of the panic. Merging that will close this issue but if you do find out that we're indeed launching more than one thread here, please do reopen this issue (or create a new one). !281 also names the thread now so it should easier to find out.
In GitLab by @fluke on Apr 8, 2021, 01:30
Can you possibly keep a changelog so that changes can be enumerated and explained? For users updating from 1.9 to 2.0 this might be good?
I've built the daemon using your thread-fix branch, will post results in a few hours.
In GitLab by @fluke on Apr 8, 2021, 03:29
Confirmed working fix @zeenix. I certainly doesn't hold threads, but it does spawn threads constantly which eventually triggers the error. I'm not 100% certain yet, but when this error happens I think I ended up with a bunch of dangling threads - running my test again.
It might be worth having connection keep a single thread open for use by receive_specific
. Especially since I've now created and closed 22,000 threads in the space of me typing this (5 minutes or so), all from notification checks spawning them in the server.
In GitLab by @fluke on Apr 8, 2021, 03:48
Okay yes, looks like I do get a sudden pile of dangling threads from that.
Can you possibly keep a changelog so that changes can be enumerated and explained?
Mate! I (unlike many other crate maintainers) put in the time to write release notes manually. I probably should also have a NEWS
file in the repo but that's then yet another manual labour for me.
For users updating from 1.9 to 2.0 this might be good?
As you may have noticed, 2.0 is in beta currently and can break and change. I intend to write a porting guide (perhaps as part of the book) once 2.0 is stable.
Confirmed working fix @zeenix. I certainly doesn't hold threads, but it does spawn threads constantly which eventually triggers the error. I'm not 100% certain yet, but when this error happens I think I ended up with a bunch of dangling threads - running my test again.
Oh ok. I wonder if threads dangle if nobody joins them? I'll check that.
It might be worth having connection keep a single thread open for use by
receive_specific
Right, at least as a short-term solution for this. I just don't like the idea of zbus running a thread in the first place and it just seems much worse if it does that permanantely even when it's not needed. However, I'm ok with doing just that as a short term solution if I couldn't find a better solution.
Long term, this should really be a task but who runs it if we don't want to depend on a async specific-runtime in zbus? Perhaps (the consumer side of) receive_specific
can do that when it waits for new messages?
mentioned in commit 5c17e289fbcfc23c0bd1d82f6a86e1c898778027
In GitLab by @fluke on Apr 8, 2021, 13:44
Oh ok. I wonder if threads dangle if nobody joins them? I'll check that. Seems likely. I'm a bit too tired right now but perhaps a VecDeque
might be a decent solution, then every iteration check them, if the count is low it's negilible overhead. Right, at least as a short-term solution for this. I just don't like the idea of zbus running a thread in the first place and it just seems much worse if it does that permanantely even when it's not needed. However, I'm ok with doing just that as a short term solution if I couldn't find a better solution.
I think owning a single thread vs spawning a potentially thousands of threads each with a heavy (IOCTL?) cost is the question. We know for certain in my use case that there is tens of thousands of threads spawning and ending, and this is going to be on well over 500 laptops (that I'm aware of so far) with potential to run for days if not weeks.
I think owning a single thread vs spawning a potentially thousands of threads
Given that choice, yes I agree completely. :) The spawning of thousands of threads is not intentional at all though so if that bug could be resolved, we're not dealing with this choice anymore.
Since you're able to reproduce, can you please check how we end up with more than one thread?
In GitLab by @fluke on Apr 8, 2021, 21:22
That's the thing though. The code you have is correct and works correctly. But the issue is that
The existing code prevents many consumers at once from starting many threads, but does prevent the scenario I have.
So I think the only solution here is to own the thread... but yeah spooling and async etc is a problem. I think it would be entirely reasonable to chuck a short sleep in the loop.
In GitLab by @fluke on Apr 8, 2021, 22:01
Just spitballing here, but what if the consumers lock is moved inside an owned thread and used to select if the thread spools with a sleep or skips the sleep?
- my client app loop checks for a notification,
- that triggers the thread spawn,
- notification completes
- thread ends
- back to first step
The existing code prevents many consumers at once from starting many threads, but does prevent the scenario I have.
I'm not sure I follow now. You say that the thread correctly ends? Is the problem that thread is launched all the time and not that we end up launching multiple threads (at the same time)? I understood the issue was latter:
but when this error happens I think I ended up with a bunch of dangling threads.
Just spitballing here, but what if the consumers lock is moved inside an owned thread and used to select if the thread spools with a sleep or skips the sleep?
Not sure I follow (some code sample would help) but in general I wouldn't be excited about adding sleeps in a library.
Having said that, I'm leaning towards having a permanant thread running for each connection, with a FIXME: Ideally this should be an async task
comment.
In GitLab by @fluke on Apr 9, 2021, 23:07
I'm not sure I follow now. You say that the thread correctly ends? Is the problem that thread is launched all the time and not that we end up launching multiple threads (at the same time)? I understood the issue was latter:
Sorry I wasn't very clear. Yes the behaviour is correct now. Except there are two issues now, one is a result of the other:
Scenario:
The behaviour that happens with your branch + previous:
next_signal()
the server calls receive_specific()
, which then creates a thread for the work.This part is fine, it behaves as you'd expect with and without the thread spawn method change. But...
next_signal()
in an endless loop (with a 100ms sleep to prevent CPU maxing)This means that in the server a thread is created and ended every 100ms. Eventually the OS says "No you can't have a thread" causing the root issue of this bug report.
Now there's a new issue resulting from the constant thread spawning - eventually I get a small bundle of dangling threads. I suspect this happens at the "Computer says no" stage of step 2, but I never got an error in my log so I can't be sure.
A code sample is real easy - create a client, and server with a signal, have the client check for signal in a loop. On my system it can take an hour or more with the 100ms sleep, so maybe with a smaller period you'll hit the issue sooner.
@fluke thanks for explaining in detail. :thumbsup:
This means that in the server a thread is created and ended every 100ms. Eventually the OS says "No you can't have a thread" causing the root issue of this bug report.
Right, so somehow it is the creation of threads in high frequency that is an issue. Interesting. Anyway, I'll (reluctantly) make the thread run per connection. Hopefully people don't use thousands of connections. :)
A code sample is real easy
Great. Could you please write it while I address the issue? Hopefully I can turn it into a testcase (unless it needs more than 5 seconds to run).
mentioned in commit fcaac3e16c7263b1350e39ab94eeb84f79f52644
Anyway, I'll (reluctantly) make the thread run per connection.
Oh that turned out to be much harder than I thought. :smiley_cat: I'll try another approach..
In GitLab by @fluke on Apr 11, 2021, 23:30
Great. Could you please write it while I address the issue? Hopefully I can turn it into a testcase (unless it needs more than 5 seconds to run).
Sorry, it needs over 30 minutes. No guarantee it'll fail in a container either.
Sorry, it needs over 30 minutes. No guarantee it'll fail in a container either.
Shame but please do provide the code if you can, so I can at least test my fix locally.
mentioned in commit 8b76e6b89072830a64004bb96aedcad869f991a8
There, had to do quite a bit of surgery and rework but I believe !283 solves your problem, along with another issue I really wanted to solve. Please give it a spin. I think the only way you'd be able to reproduce the issue still if you create (clones don't count) and destroy Connection
all the time. :)
In GitLab by @fluke on Apr 13, 2021, 02:08
I'll try to get this done in the next day or so.
In GitLab by @fluke on Apr 13, 2021, 02:08
Nice little chunk of work! Testing now, will report back in a few hours.
In GitLab by @fluke on Apr 13, 2021, 02:21
CPU use seems to be lower too. From 0.25% to 0.05%. This is making about a 1w difference in idle power consumption on battery (your work has lowered it!).
So even with notification check every 10ms, it's made a huge difference. As I understand it, creating a thread is costly because you end up calling a syscall or two for each thread?
In GitLab by @fluke on Apr 13, 2021, 04:30
Nearly 3 hours without flaw so far. I'd say you're good to merge this. Looking forward to new release!
Awesome! Yeah, thread launching on its own is pretty cheap but not free as it involves context switching etc so if you do it very often, the accumulated cost can get prety significant. Thanks for checking the power consumption as well, that's helpful information.
mentioned in commit 984428d42098e7ffefdfe67bef36c11a1fbdde8d
Nearly 3 hours without flaw so far. I'd say you're good to merge this. Looking forward to new release!
cool. I cleaned-up the last patch a little and now I'll merge.
mentioned in commit db96ead341876b210e340a3306d61611d5a04444
In GitLab by @fluke on Apr 7, 2021, 03:44
asusd is a long running process and can receive many messages over time. The BT is:
Where the lines:
indicate that zbus isn't handling a failure in a way that will prevent crashing.