Unknown object on service activation #215

Closed zeenix closed 1 year ago

In GitLab by @elmarco on Sep 20, 2021, 22:44

I observe an activation bug (somewhat reminiscent of #68), with a system activated service on fc34:

#[async_std::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let connection = Connection::system()
        .await?
        .request_name("org.freedesktop.usbredir1")
        .await?;

    connection.object_server_mut().await.at(
        "/org/freedesktop/usbredir1",
        Interface::new(&connection).await?,
    )?;

    loop {
        let listener = connection.monitor_activity();
        if !listener.wait_timeout(Duration::from_secs(10)) {
            break;
        }
    }

    Ok(())
}

I get a reliable error on activation (subsequent calls ok while the service is running):

$ busctl call org.freedesktop.usbredir1 /org/freedesktop/usbredir1 org.freedesktop.usbredir1 OpenBusDev yy 1 59
Call failed: Unknown object '/org/freedesktop/usbredir1'

Interestingly, if I modify zbus to request the name after the server is setup, I don't get the race or the issue or the timeout from #68.

hmm.. could you check if making ObjectServer not on-demand helps?

In GitLab by @elmarco on Sep 20, 2021, 23:07

Since the server task is started immediately, even before we register a path (or a service name, with the modification I tried), there are always chances that we loose messages there. (with the sync API, I didn't have this problem, because the ordering was more in control, and we have/had queues iirc).

Maybe we need to a way to start/stop the server task...

Since the server task is started immediately, even before we register a path (or a service name, with the modification I tried), there are always chances that we loose messages there.

Hmm.. true.

with the sync API, I didn't have this problem, because the ordering was more in control, and we have/had queues iirc).

You would definitely have this problem if I hadn't insisted on having the request_name API on ObjectServer instead of Connection.

Maybe we need to a way to start/stop the server task...

maybe and I'm actually thinking about that due to other reasons, but here maybe the solution is to go back to request_name on ObjectServer.

In GitLab by @elmarco on Sep 21, 2021, 10:39

with the sync API, I didn't have this problem, because the ordering was more in control, and we have/had queues iirc).

You would definitely have this problem if I hadn't insisted on having the request_name API on ObjectServer instead of Connection.

There is no point denying it was simpler to get it working reliably when everything was sync. The code has evolved since then, you are more familiar than I am with the async stuff. I don't recall you had to ~~insist~~ convince me much on that request_name change anyway.

Since the receiver task starts early, chances of losing messages are still high, regardless of when request_name or object server is setup.

Either everything must be setup before starting the task, or we need some queuing somewhere. I think I would rather have 1.

Maybe we need to a way to start/stop the server task...

maybe and I'm actually thinking about that due to other reasons, but here maybe the solution is to go back to request_name on ObjectServer.

Please explain, what race are you solving that way.

There is no point denying it was simpler to get it working reliably when everything was sync

68 was filed long long before the recent async changes, you know. This recent regression was not caused by the changes you really want to blame. This was caused by dac50765a6d1d02b360e2c7a3746d1308ff216d0, and (as you can read in the commit log) I had mistakenly assumed that it'll be fixed back again after change to run ObjectServer internally in a task (i-e within the same MR).

So while you may be right about how things used to be simpler in your days, I don't see how the async changes are the culprit here. If you say those changes make it harder to solve this, maybe I could agree there but they're not the cause.

Since the receiver task starts early, chances of losing messages are still high, regardless of when request_name or object server is setup.

If name is requested after the service API is setup, there is no chance of missing messages. We even got rid of the overflow to ensure that in general. The only thing left that can still miss messages is property caching code, that I'm trying to also fix (!383).

I don't recall you had to ~~insist~~ convince me much on that request_name change anyway.

https://gitlab.freedesktop.org/dbus/zbus/-/merge_requests/332#note_974621

That was much much longer discussion than I wanted.

In GitLab by @elmarco on Sep 21, 2021, 14:16

https://gitlab.freedesktop.org/dbus/zbus/-/merge_requests/332#note_974621

That was much much longer discussion than I wanted.

Ok, but please don't "insist" but demonstrate: we didn't nail it apparently. So the discussion goes on.

In GitLab by @danieldg on Sep 21, 2021, 15:51

I would consider using request_name before setting up the ObjectServer to always be incorrect; request_name even references #68 in its code, though I don't see the referenced warning in the docs. I think this means that the prototype of that function should be &self and not self, to discourage its incorrect use in a builder pattern.

@danieldg Yeah, agreed. Also moving it back to the ObjectServer, we can panic (or emit warnings once we've logging) if request_name is used before any at calls from the user (will need to be careful about excluding ObjectServer's own at usage though).

In GitLab by @elmarco on Sep 21, 2021, 17:40

You could still have such warning, even if the request/release name methods remains on Connection. I still don't agree that should be ObjectServer methods, and you have demonstrated that it fits in Connection so far.

Furthermore, it's completely valid to have at() being called after request name is called, as you may want to have more objects added during runtime.

Instead, I think we should have logging/warning for dropping incoming messages, yes!

You could still have such warning, even if the request/release name methods remains on Connection. I still don't agree that should be ObjectServer methods, and you have demonstrated that it fits in Connection so far.

I guess if the methods don't implement the builder pattern (as @danieldg pointed out) and we change the book's server chapter to not mention requesting name before showing how to expose their interfaces, it'd already help avoid this issue.

Furthermore, it's completely valid to have at() being called after request name is called, as you may want to have more objects added during runtime.

Right, it's valid. I was mostly just thinking out loud about the panic part. :)

In GitLab by @elmarco on Sep 21, 2021, 21:15

Another painful similar case (while converting code that used to work, ..)

In p2p (not taking service name, of course), the msg receiver task starts immediately after build(), before I can setup the object server.

In GitLab by @danieldg on Sep 21, 2021, 21:30

Sounds like having the object server available in the builder would be a good idea.

Another painful similar case (while converting code that used to work, ..)

perhaps you should add good test cases for those so I'd know when I'm about to cause you pain?

In p2p (not taking service name, of course), the msg receiver task starts immediately after build(), before I can setup the object server.

The task only starts after the first time you access the ObjectServer but of course currently you can only add one interface at a time and the task doesn't exactly wait for the first interface to be added either.

Sounds like having the object server available in the builder would be a good idea.

I think it's best what @elmarco suggested earlier here: explicit API to start (and maybe also stop) the task. Just thinking out loud: maybe give the task handle to the user so they can either detach() it (which will give them mostly the current behavior), .await it or do other things like join'ing it or select'ing it with other futures (think timeout).

mentioned in commit 9f32190e83f51ecc9c9b33ddac00e532d4633557

How about this plan:

blocking::Task
- async_task::Task wrapper
- cancel()
- detach()
- wait()
- impl Deref
- impl From and Into async_task::Task
ObjectServer::start() -> async_task::Task
blocking::ObjectServer::start() -> blocking::Task

I think that'd be a good starting point and will allow us to add future API later as needed w/o breaking API. e.g wait_timeout() when we've the mechanism in place to monitor ObjectServer traffic.

dbus2 / zbus-old

Unknown object on service activation #215