Closed zeenix closed 1 year ago
As I wrote on Matrix channel: You drop the signal stream at the end of your loop so you're bound to miss signals as dropping the stream unsubscribes you from the signal.
In GitLab by @ludo-c on Nov 19, 2022, 14:15
The point here is not to miss signal between the end of the loop and the next subscription, but rather that the signal end up unsubscribed.
Of course in my app I do not create and delete the proxy all the time, but that's the only way I find to reproduce
Because the process creates the proxy once at launch and do not recreate it, I have no clue to check that the signal is well subscribed
In GitLab by @plabedan on Nov 23, 2022, 11:58
I manage to reproduce the issue with the sample code, but:
Thanks so much! The problem is that I wasn't able to reproduce even though I ran the reproducer apps for almost an hour. I hammered the CPU as much as I could (playing a video and cargo build
at the same time) but still couldn't reproduce.
I think it's easier to reproduce with low number of cpus (I've 16 vCPUs on my machine). I'll create a low-end VM to try again.
From what I know so far, it looks mostly likely that a stream-consuming task is being starved. I'm not sure why async-std decided to fork their own task API/impl but I wouldn't be surprised if this happens because async-std's tasks takes away all the cpu time from async-task and async-io.
A workaround in the meantime could be not to use async-std but either smol's crates or tokio (don't forget to enable tokio
feature).
In GitLab by @ludo-c on Nov 23, 2022, 15:16
Thank you for you reply and for your advice !
I forgot to mention that but I already tested it with Tokio on the target with no improvement.
Changing the async runtime from async-std
to tokio
in the sample app does not help, I'm still able reproduce the issue.
I did not add any tick
as it seems it is not necessary anymore. Only the Cargo.toml
file and the use
lines has been changed.
Changes are here : https://github.com/ludo-c/fake-listener/commit/750ed41fa2251d364317ccdd23234f4aa2f7b88b
@ludo-c Oh that's interesting. Thanks for this info.
Could you please also try with multithreaded runtime?
#[tokio::main(flavor = "multi_thread", worker_threads = 2)]
also with worker_threads = 10
if still reproducible with only 2?
Since async-std does multihreaded runtime by default, I doubt this will help but just to make sure.
In GitLab by @ludo-c on Nov 23, 2022, 17:21
As you said, it doesn't improve things, event with #[tokio::main(flavor = "multi_thread", worker_threads = 10)]
Thanks for trying and confirming.
@ludo-c with tokio, you should be able to use tokio-console
to see what all the tasks are up to. Can you do so when the issue is reproduced?
From some of your logs I see that the socket reading task in zbus doesn't output this trace as the last one but rather this one, which means my initial assumption about some message or signal stream not being read often enough and broadcast channel getting full as a result, is not correct. If that assumption was correct, this call wouldn't return.
In GitLab by @ludo-c on Nov 24, 2022, 15:04
I have cut the logs because these logs are in loop. There is no "last one".
Broadcast does not block and the log runs until I stop the app.
But The SignalStream in the app in no more called.
Ah ok. If the broadcast never blocks, that still means it's not an issue of a blocked/starving stream then. This is getting more and more mysterious.
@ludo-c @plabedan if either or both of you could test !605 to see if it solves the problem, that'd be great.
In GitLab by @ludo-c on Nov 25, 2022, 10:24
Thanks a lot for your reactivity!
We have tested it on the test program showed earlier and it indeed fixed the problem. Even with stress-ng -c 128 -m 64 -i 32
.
We are currently investigating it on the "real software" where we initially raised the issue and we'll get back to you ASAP.
Yay! I'll wait anxiously.
In GitLab by @ludo-c on Nov 25, 2022, 16:49
We used to have the issue every 15 tests, with this changes we have 100% OK on 230 tests so far, we leave an automated test for 4 days running.
Thank you very much !
mentioned in commit e055ddd407c48efb5435c71a0a9b426f969a4d34
Thanks so much. I'm merged already. By this point, I'm sure the issue is resolved. If you still get a hang, it's unlikely to be the same issue but if it is, feel free to re-open this issue.
I found some other issues in the same code so I'll be fixing those soon. I'll humbly request you to test those changes as well so the upcoming release works for you.
In GitLab by @ludo-c on Nov 26, 2022, 09:29
Sure, I'll be glad to run that test on future releases.
Don't hesitate ask me 😉
In GitLab by @ludo-c on Nov 19, 2022, 24:55
Hello,
As discussed on Matrix, it sometimes happen that the SignalStream created from the
receive_
method from the generated async Proxy does not produce any message.That's a little bit rough, but this sample code reproduce it after some minutes:
Steps to reproduce are in the listener readme : https://github.com/ludo-c/fake-listener/blob/main/README.md
After few minutes, the log
ERROR fake_listener] Signal received.
isn't showed anymore, and logs are only:in loop.
For what it's worth, stopping and restarting the producer solve the issue, but that's of course not the good solution (I have no control on the producer in the real case).
I hope you will reproduce it too.
Do not hesitate if you have any questions.
Thank you very much !
Edit: I reproduce it on Ubuntu 22.04.1 LTS.