monitoring components for segmentation/CPU faults

As described on the mailing list [0], I'm investigating a way to monitor if a component crashes due to segmentation/CPU faults.

The test in test-fault_detection already demonstrates how this can be done for PD sessions (segmentation faults).

My first approach [1] was to integrate this feature into init, whereby all components can be monitored without any changes. This approach fits nicely with our scenario in which we already have heartbeat_monitors (for some inits) that can restart unresponsive components in an init by observing its state report and updating the configuration accordingly.

@chelmuth rightfully raised concerns if it is worth increasing the complexity of such a vital component of the system, as most faulty components are probably ported POSIX applications. I very much like his idea of integrating such detection into the libc.

My initial failing component was a faulty VFS-plugin which created a segmentation fault in a non-entrypoint thread. As the entrypoint of the VFS-component still was alive, init couldn't detect that the component was no longer alive. For this, another solution would be required. I could imagine some slimmed-down gdb_monitor that only monitors the CPU and PD sessions for faults and passes through anything else. The main disadvantage I see in this approach is that an additional component is needed for each component that needs monitoring. On the other hand, this should rarely be needed; therefore, I would deem it a suitable compromise.

[0] https://lists.genode.org/pipermail/users/2022-July/008029.html [1] https://github.com/trimpim/genode/tree/sandbox-fault_detection

genodelabs / genode

monitoring components for segmentation/CPU faults #4571