eclipse-bluechi / bluechi

Eclipse BlueChi is a systemd service controller intended for multi-node environments with a predefined number of nodes and with a focus on highly regulated ecosystems such as those requiring functional safety.
https://bluechi.readthedocs.io/en/latest/
GNU Lesser General Public License v2.1
130 stars 37 forks source link

controller: Proactively disconnect node based on heartbeat #870

Closed ueno closed 2 months ago

ueno commented 4 months ago

This adds a couple of new options to controller: HeartbeatInterval and NodeHeartbeatThreshold, which can be used to actively disconnect node based on the last sent heartbeat. The controller periodically checks the last seen timestamp of nodes, and if it was sent before NodeHeartbeatThreshold, the controller treats it as disconnected.

Fixes: #857

ueno commented 4 months ago

This is an early draft of #857, where I have a couple of questions to clarify before writing the integration tests:

Sorry if these are too newbie questions, but I would appreciate any feedback.

engelmi commented 4 months ago

Sorry for the late reply @ueno

Sorry if these are too newbie questions, but I would appreciate any feedback.

No worries about that. We are always happy to help :)

Do we want to send D-Bus signal from the controller as well?

When we force node disconnect, we'll emit signals on the controller and the agent that it disconnected. I think these are sufficient. We could add a reason for disconnecting, but I'd consider this out of scope for this task (maybe a nice future enhancement).

In the Disconnected handler (node_disconnected), there is a conditional depending on node->name with additional logic to controller_remove_node. Can't this be consolidated with controller_remove_node?

Maybe it could be consolidated, but that could become a quite tricky change. Since it is not necessary for the issue this PR tries to implement, I'd rather evaluate and tackle it in a different issue.

The last_seen timestamp was previously recorded as seconds; now we need to record it as at least milliseconds for the new options. Should we unify the representation to struct timespec?

Makes sense to me. @mkemel What do you think?

coveralls commented 4 months ago

Coverage Status

coverage: 84.218% (-0.6%) from 84.834% when pulling e6a0f1bfc20b49675363b9185749cc5178ed58bc on ueno:wip/controller-heartbeat into 81463c4ae1a2b12090430a16f1d1aac32b7e4a43 on eclipse-bluechi:main.

engelmi commented 2 months ago

The RFE this PR targets has been implemented in #911. Closing.