dbus2 / zbus-old

Rust D-Bus crate.
https://gitlab.freedesktop.org/dbus/zbus
Other
49 stars 13 forks source link

Sometimes SignalStream does not produce any value. #302

Closed zeenix closed 1 year ago

zeenix commented 1 year ago

In GitLab by @ludo-c on Nov 19, 2022, 24:55

Hello,

As discussed on Matrix, it sometimes happen that the SignalStream created from the receive_ method from the generated async Proxy does not produce any message.

That's a little bit rough, but this sample code reproduce it after some minutes:

Steps to reproduce are in the listener readme : https://github.com/ludo-c/fake-listener/blob/main/README.md

After few minutes, the log ERROR fake_listener] Signal received. isn't showed anymore, and logs are only:

1232022-11-18T22:49:15.203537Z TRACE receive_msg: 122: Message received on the socket: Msg { type: Signal, sender: UniqueName(Str(Borrowed(":1.18160"))), path: ObjectPath("/ludo_ic/daemon/producer"), iface: InterfaceName(Str(Borrowed("ludo_ic.daemon.producer"))), member: MemberName(Str(Borrowed("MySignalEvent"))), body: Signature("ii") }
2022-11-18T22:49:15.203669Z TRACE receive_msg: 131: Message broadcasted to all streams: Msg { type: Signal, sender: UniqueName(Str(Borrowed(":1.18160"))), path: ObjectPath("/ludo_ic/daemon/producer"), iface: InterfaceName(Str(Borrowed("ludo_ic.daemon.producer"))), member: MemberName(Str(Borrowed("MySignalEvent"))), body: Signature("ii") }
2022-11-18T22:49:15.203793Z TRACE receive_msg: 107: Waiting for message on the socket..

in loop.

For what it's worth, stopping and restarting the producer solve the issue, but that's of course not the good solution (I have no control on the producer in the real case).

I hope you will reproduce it too.

Do not hesitate if you have any questions.

Thank you very much !

Edit: I reproduce it on Ubuntu 22.04.1 LTS.

zeenix commented 1 year ago

As I wrote on Matrix channel: You drop the signal stream at the end of your loop so you're bound to miss signals as dropping the stream unsubscribes you from the signal.

zeenix commented 1 year ago

In GitLab by @ludo-c on Nov 19, 2022, 14:15

The point here is not to miss signal between the end of the loop and the next subscription, but rather that the signal end up unsubscribed.

Of course in my app I do not create and delete the proxy all the time, but that's the only way I find to reproduce

Because the process creates the proxy once at launch and do not recreate it, I have no clue to check that the signal is well subscribed

zeenix commented 1 year ago

In GitLab by @plabedan on Nov 23, 2022, 11:58

I manage to reproduce the issue with the sample code, but:

zeenix commented 1 year ago

Thanks so much! The problem is that I wasn't able to reproduce even though I ran the reproducer apps for almost an hour. I hammered the CPU as much as I could (playing a video and cargo build at the same time) but still couldn't reproduce.

I think it's easier to reproduce with low number of cpus (I've 16 vCPUs on my machine). I'll create a low-end VM to try again.

From what I know so far, it looks mostly likely that a stream-consuming task is being starved. I'm not sure why async-std decided to fork their own task API/impl but I wouldn't be surprised if this happens because async-std's tasks takes away all the cpu time from async-task and async-io.

A workaround in the meantime could be not to use async-std but either smol's crates or tokio (don't forget to enable tokio feature).

zeenix commented 1 year ago

In GitLab by @ludo-c on Nov 23, 2022, 15:16

Thank you for you reply and for your advice !

I forgot to mention that but I already tested it with Tokio on the target with no improvement.

Changing the async runtime from async-std to tokio in the sample app does not help, I'm still able reproduce the issue.

I did not add any tick as it seems it is not necessary anymore. Only the Cargo.toml file and the use lines has been changed.

Changes are here : https://github.com/ludo-c/fake-listener/commit/750ed41fa2251d364317ccdd23234f4aa2f7b88b

zeenix commented 1 year ago

@ludo-c Oh that's interesting. Thanks for this info.

Could you please also try with multithreaded runtime?

#[tokio::main(flavor = "multi_thread", worker_threads = 2)]

also with worker_threads = 10 if still reproducible with only 2?

zeenix commented 1 year ago

Since async-std does multihreaded runtime by default, I doubt this will help but just to make sure.

zeenix commented 1 year ago

In GitLab by @ludo-c on Nov 23, 2022, 17:21

As you said, it doesn't improve things, event with #[tokio::main(flavor = "multi_thread", worker_threads = 10)]

zeenix commented 1 year ago

Thanks for trying and confirming.

zeenix commented 1 year ago

@ludo-c with tokio, you should be able to use tokio-console to see what all the tasks are up to. Can you do so when the issue is reproduced?

From some of your logs I see that the socket reading task in zbus doesn't output this trace as the last one but rather this one, which means my initial assumption about some message or signal stream not being read often enough and broadcast channel getting full as a result, is not correct. If that assumption was correct, this call wouldn't return.

zeenix commented 1 year ago

In GitLab by @ludo-c on Nov 24, 2022, 15:04

I have cut the logs because these logs are in loop. There is no "last one".

Broadcast does not block and the log runs until I stop the app.

But The SignalStream in the app in no more called.

zeenix commented 1 year ago

Ah ok. If the broadcast never blocks, that still means it's not an issue of a blocked/starving stream then. This is getting more and more mysterious.

zeenix commented 1 year ago

@ludo-c @plabedan if either or both of you could test !605 to see if it solves the problem, that'd be great.

zeenix commented 1 year ago

In GitLab by @ludo-c on Nov 25, 2022, 10:24

Thanks a lot for your reactivity!

We have tested it on the test program showed earlier and it indeed fixed the problem. Even with stress-ng -c 128 -m 64 -i 32.

We are currently investigating it on the "real software" where we initially raised the issue and we'll get back to you ASAP.

zeenix commented 1 year ago

Yay! I'll wait anxiously.

zeenix commented 1 year ago

In GitLab by @ludo-c on Nov 25, 2022, 16:49

We used to have the issue every 15 tests, with this changes we have 100% OK on 230 tests so far, we leave an automated test for 4 days running.

Thank you very much !

zeenix commented 1 year ago

mentioned in commit e055ddd407c48efb5435c71a0a9b426f969a4d34

zeenix commented 1 year ago

Thanks so much. I'm merged already. By this point, I'm sure the issue is resolved. If you still get a hang, it's unlikely to be the same issue but if it is, feel free to re-open this issue.

I found some other issues in the same code so I'll be fixing those soon. I'll humbly request you to test those changes as well so the upcoming release works for you.

zeenix commented 1 year ago

In GitLab by @ludo-c on Nov 26, 2022, 09:29

Sure, I'll be glad to run that test on future releases.

Don't hesitate ask me 😉