Race condition when client starts before server causing hanging method call

ids1024 commented 4 weeks ago

I've noticed a few issues in method calls and signals when a DBus service restarts, or isn't initially running. (Some probably being issues with C clients connecting to my zbus-based service.) So I wrote a simple test for this.

I've noticed an interesting behavior, which may or may not be related to issues I've seen before.

Server code:

use zbus::object_server::SignalEmitter;

struct FooBar;

#[zbus::interface(name = "com.ids1024.FooBar")]
impl FooBar {
    async fn ack_baz(&self) {
        println!("ack");
    }
    #[zbus(signal)]
    async fn baz(emitter: &SignalEmitter<'_>, value: u32) -> zbus::Result<()>;
}

#[tokio::main]
async fn main() -> zbus::Result<()> {
    let conn = zbus::Connection::session().await?;
    conn.request_name("com.ids1024.FooBar").await?;
    conn.object_server().at("/com/ids1024/FooBar", FooBar).await?;
    let emitter = SignalEmitter::new(&conn, "/com/ids1024/FooBar")?;
    let mut count = 0;
    loop {
        FooBar::baz(&emitter, count).await?;
        tokio::time::sleep(std::time::Duration::from_secs(1)).await;
        count = count.wrapping_add(1);
    }
}

Client:

use futures_lite::stream::StreamExt;

#[zbus::proxy(name = "com.ids1024.FooBar")]
trait FooBar {
    async fn ack_baz(&self) -> zbus::Result<()>;
    #[zbus(signal)]
    async fn baz(value: u32) -> zbus::Result<()>;
}

#[tokio::main]
async fn main() -> zbus::Result<()> {
    let conn = zbus::Connection::session().await?;
    let foobar = FooBarProxy::new(&conn, "com.ids1024.FooBar", "/com/ids1024/FooBar").await?;
    let mut stream = foobar.receive_baz().await?;
    while let Some(event) = stream.next().await {
        let value = event.args()?.value;
        // tokio::time::sleep(std::time::Duration::from_millis(100)).await;
        println!("ack...");
        if let Err(err) = foobar.ack_baz().await {
            println!("err: {err}");
        }
        println!("baz: {value}");
    }
    Ok(())
}

The server sends the Baz signal every second, and the client simply listens for this and replies with AckBaz.

If I start the client, then the server, it often (but not always) hangs at ack..., calling the AckBaz method but not getting the reply. If it doesn't hang, restarting the server likely will make it do so.

dbus-monitor --session shows the method call but not response, so I guess this is a bug in the z us server? Logging receive_message shows it does seem to get the method call message.

The issue is inconsistent, and uncommenting the sleep can fix it, so it seems to be some sort of race condition? Not sure how.

zeenix commented 4 weeks ago

@ids1024 Thanks so much for filing this. I think a colleague of mine also noticed something like this a month back but then we forgot about it. :) I'll look into this soon. Likely (hopefully) it's something small.

zeenix commented 4 weeks ago

btw, in zbus 5, you don't need SignalEmitter outside of method handlers. You can just do:

conn
    .object_server()
    // You can cache the interface ref from this call.
    .interface("/com/ids1024/FooBar")
    .await
    .baz(count)
    .await?

and also from within the method handlers, you can directly call the methods on the SignalEmitter instance you can get through the special arg.

Oh and it's almost always best to make use of connection::Builder for setting up connection and object server. It's unlikely to make any difference here though but just wanted a general suggestion.

ids1024 commented 4 weeks ago

Yeah, I know it's also possible to create the object in the connection builder (but I think should behave the same?). Good to know about that for signal emitting.

At some point I'll need to do more testing with the clients where I've seen issues related to restarting the server. It also seems like gdbus handles this differently than zbus, and proxies don't work when the server has restarted (g_bus_watch_name should be used to destroy and create them as the name changes owner.)

zeenix commented 4 weeks ago

Yeah, I know it's also possible to create the object in the connection builder (but I think should behave the same?).

Not really. If the service is launched because of a method call, the method call will be sent immediately after the service registers its name, so if the object server/interface has not yet been set up, the call will go unanswered. I actually first thought that's what is happening in your case but then I saw that the client actually waits for a signal from the service first and service only sends that out after it's all set up. Hence why I said, it's better in general to always use the connection builder to avoid such issues.

At some point I'll need to do more testing with the clients where I've seen issues related to restarting the server.

You mean other issues with zbus clients, that are different from this issue? :open_mouth:

ids1024 commented 4 weeks ago

You mean other issues with zbus clients, that are different from this issue?

I mean that I'm not sure this is the same issue I've seen in actual use. With luck it will have the same cause and be easy to fix.

zeenix commented 4 weeks ago

I was able to reproduce the issue here. I did some tracing and the issue is that the object server task is only getting launched after the method call has already been received and hence dropped (cause object server isn't there to act on it):

2024-10-30T17:21:28.450326Z TRACE socket reader: zbus::connection::socket_reader: Message received on the socket: Msg { type: MethodCall, serial: 5, sender: UniqueName(":1.659287"), path: ObjectPath("/com/ids1024/FooBar"), iface: InterfaceName("com.ids1024.FooBar"), member: MemberName("AckBaz"), fds: [] }
2024-10-30T17:21:28.450470Z TRACE start_object_server{started_event=None}:{}{obj_server_task_name="ObjectServer task"}: zbus::connection: waiting for incoming method call messages..

I'll check how this can be helped..

zeenix commented 4 weeks ago

I'll check how this can be helped..

So turns out my first hunch was right on the money. The Connection::object_server launched the server asynchronously and doesn't wait for it to be running. While connection::Builder completely eliminates this issue, I think this is still a footgun so I'll look into making Connection::object_server wait for the server setup. If we're lucky, it'd be a super trivial fix.

In the meantime, please use connection::Builder in your service code. It's always better. :)

ids1024 commented 4 weeks ago

I think I've mostly used Builder::serve_at, but good to know it makes a difference here. ObjectServer::at is necessary for dynamic/temporary objects, but at least in that case the object server should already have been created.

Looks like another issue I've seen is just due to PropertyStream getting an initial property value if a service is already running, but not if a service (that isn't socket activated) starts some time later. But it does get changes to the property that occur after start.

Seems somewhat unexpected, but not too broken. Should be possible to work-around by manually dealing with NameOwnerChangedStream.

zeenix commented 4 weeks ago

Looks like another issue I've seen is just due to PropertyStream getting an initial property value if a service is already running, but not if a service (that isn't socket activated) starts some time later. But it does get changes to the property that occur after start.

Ah ok. Could you please create another issue for that? :pray:

Seems somewhat unexpected, but not too broken. Should be possible to work-around by manually dealing with NameOwnerChangedStream.

yeah or OwnerChangedStream.

zeenix commented 3 weeks ago

Two issue with making Connection::object_server wait for the ObjectServer to be up and running before returning:

It'll require it to block, which is very unexpected (there is a reason why blocking::ObjectServer exists) and could cause issues. We could make it async but that would require another API break.
It's going to be a lot more difficult to implement than I thought.

So given that and the fact that the builder pattern should be preferred (in general but especially when timing of the object server is important ) anyway, I think it's best we don't do this and instead document the strong recommendation to prefer the builder, better. Any advice on when and how to document this, would be appreciated. Perhaps a question in our FAQ?

dbus2 / zbus

Race condition when client starts before server causing hanging method call #1109