DiamondLightSource / mx-bluesky

Bluesky plans, plan stubs, and utilities for MX beamlines

https://diamondlightsource.github.io/mx-bluesky/

Apache License 2.0

0 stars 2 forks source link

Stop Hyperion if external interactions fail #298

Open DominicOram opened 9 months ago

DominicOram commented 9 months ago

With moving interacting with external services into a different process we need to make sure we still correctly handle failure modes. The ideal way of handling this is to pile up the non-urgent jobs we need to do and do them when services come back online. However, this is complicated. For the first instance we should stop Hyperion if any of the following fail:

Nexus writing
Ispyb deposition
Triggering analysis

Acceptance Criteria

Hyperion stops if any of these things fail
There are tests for this
There is an issue about how to handle it more gracefully

dperl-dls commented 9 months ago

We should, at the start of an experiment:

check if the ZMQ connection to the data service is active
- if it isn't, try to restart the service
- stop completely

This should be able to be handled fairly easily in __main__.py using some derivative of the monitoring code at https://github.com/DiamondLightSource/hyperion/blob/947_run_callbacks_in_separate_process/tests/system_tests/external_interaction/callbacks/test_external_callbacks.py

DominicOram commented 9 months ago

Is the start enough? If we do it just at the start it might not be obvious that it's the data in the last run that is potentially corrupted

dperl-dls commented 9 months ago

ah yeah, that's not enough. I think the simplest way to get the info back to Hyperion is that the data service should remember if it failed something, and refuse to start if the last run went wrong. Otherwise we need some kind of DataserviceLivenessDevice and I don't like where that would be going...