cockpit-project / cockpit

Cockpit is a web-based graphical interface for servers.
http://www.cockpit-project.org/
GNU Lesser General Public License v2.1
11.19k stars 1.11k forks source link

TestAccounts.testUserPasswords is flaky on arch #20609

Open martinpitt opened 4 months ago

martinpitt commented 4 months ago

The weather report shows this very prominently on arch.

On CI infa there's just one additional test which runs before:

TEST_OS=arch test/verify/check-users TestAccounts.testCustomUserProperties TestAccounts.testUserPasswords -stv $RUNC

But that's not it. Locally I can reproduce in a loop of 10 runs, but it's very hard.

With an extra debug comit I get debug logs with passing and failing tests on CI.

In the passing log, there is a single query to logind for which users are currently active, and it gives the expected result (a "closing" state):

> debug: dbus: {"type":"ss","flags":"","timeout":5000,"call":["/org/freedesktop/login1/user/_1000","org.freedesktop.DBus.Properties","Get",["org.freedesktop.login1.User","State"]],"id":"19"}
> debug: dbus: {
  "reply": [
    [
      {
        "t": "s",
        "v": "closing"
      }
    ]
  ],
  "id": "19",
  "flags": "<",
  "type": "v"
}

But in the failing log there are four of these calls, presumably because the test waits much longer. But all of these still report session state "active". In other words, the loginctl terminate-user admin somehow didn't work. In the passing log, there are countless PropertiesChanged events for /org/freedesktop/systemd1/unit/user_401000_2eservice, going from "active" to "deactivating" (multiple substates like stop-sigterm and final-sigkill to stop), and finally an UnitRemoved for it. We also get a SessionRemoved signal (which our event handler reacts to), but somehow querying logind at that point still gives the old "active" value.

Originally posted by @martinpitt in https://github.com/cockpit-project/cockpit/issues/20579#issuecomment-2172527892

martinpitt commented 4 months ago

FTR: It's not the debouncing; running the event handler every time still fails. I also checked the docs, and debounce() should run one last time after a series of events.

But it's not a "static" failure -- after this fails, I ssh in and loginctl list-users shows

1000 admin no closing

so it's something about events.