As described on the mailing list [0], I'm investigating a way to monitor if a component crashes due to segmentation/CPU faults.
The test in test-fault_detection already demonstrates how this can be done for PD sessions (segmentation faults).
My first approach [1] was to integrate this feature into init, whereby all components can be monitored without any changes. This approach fits nicely with our scenario in which we already have heartbeat_monitors (for some inits) that can restart unresponsive components in an init by observing its state report and updating the configuration accordingly.
@chelmuth rightfully raised concerns if it is worth increasing the complexity of such a vital component of the system, as most faulty components are probably ported POSIX applications. I very much like his idea of integrating such detection into the libc.
My initial failing component was a faulty VFS-plugin which created a segmentation fault in a non-entrypoint thread. As the entrypoint of the VFS-component still was alive, init couldn't detect that the component was no longer alive.
For this, another solution would be required. I could imagine some slimmed-down gdb_monitor that only monitors the CPU and PD sessions for faults and passes through anything else. The main disadvantage I see in this approach is that an additional component is needed for each component that needs monitoring. On the other hand, this should rarely be needed; therefore, I would deem it a suitable compromise.
As described on the mailing list [0], I'm investigating a way to monitor if a component crashes due to segmentation/CPU faults.
The test in
test-fault_detection
already demonstrates how this can be done for PD sessions (segmentation faults).My first approach [1] was to integrate this feature into
init
, whereby all components can be monitored without any changes. This approach fits nicely with our scenario in which we already have heartbeat_monitors (for some inits) that can restart unresponsive components in aninit
by observing its state report and updating the configuration accordingly.@chelmuth rightfully raised concerns if it is worth increasing the complexity of such a vital component of the system, as most faulty components are probably ported POSIX applications. I very much like his idea of integrating such detection into the
libc
.My initial failing component was a faulty VFS-plugin which created a segmentation fault in a non-entrypoint thread. As the entrypoint of the VFS-component still was alive,
init
couldn't detect that the component was no longer alive. For this, another solution would be required. I could imagine some slimmed-downgdb_monitor
that only monitors the CPU and PD sessions for faults and passes through anything else. The main disadvantage I see in this approach is that an additional component is needed for each component that needs monitoring. On the other hand, this should rarely be needed; therefore, I would deem it a suitable compromise.[0] https://lists.genode.org/pipermail/users/2022-July/008029.html [1] https://github.com/trimpim/genode/tree/sandbox-fault_detection