Beamdump caused end of list execution due to error

jilavsky commented 2 years ago

We had beam dump which caught likely BS at bad moment and this ended all runs:


D Sat-08:52:20 - gain=10000000.0  rate: 109400.0  max: 950000  converged=[True, True]
D Sat-08:52:20 - gain=100000.0  rate: 169760.0  max: 950000  converged=[True, True, True, True]
Suspending....To get prompt hit Ctrl-C twice to pause.                                                                                                                        
Suspension occurred at 2021-10-16 08:52:23.
Justification for this suspension:
Signal usaxs_CheckBeamStandard is low
Suspending....To get prompt hit Ctrl-C twice to pause.                                                                                                                        
Suspension occurred at 2021-10-16 08:52:23.
Justification for this suspension:
Signal white_beam_ready_available is low
Suspender SuspendBoolLow(Signal(name='white_beam_ready_available', parent='white_beam_ready', value=True, timestamp=1634394077.5701447), sleep=100, pre_plan=None, post_plan=None,tripped_message=) reports a return to nominal conditions. Will sleep for 100 seconds and then release suspension at 2021-10-16 09:22:57.
Suspender SuspendBoolLow(EpicsSignal(read_pv='9idcLAX:blCalc:userCalc1', name='usaxs_CheckBeamStandard', value=1.0, timestamp=1634394077.607699, tolerance=1e-05, auto_monitor=False, string=False, write_pv='9idcLAX:blCalc:userCalc1', limits=False, put_complete=False), sleep=0, pre_plan=None, post_plan=None,tripped_message=) reports a return to nominal conditions. Will sleep for 0 seconds and then release suspension at 2021-10-16 09:21:17.
E Sat-09:22:57 - Exception IllegalMessageSequence during attempt 1 of 1 of command '('USAXSscan', ['35.2', '43', '4', 'Pyush_Aged_1pct_bulk'], 22, '      USAXSscan        35.2      43      4      "Pyush_Aged_1pct_bulk"')''
subject: IllegalMessageSequence during attempt 1 of 1 of command '('USAXSscan', ['35.2', '43', '4', 'Pyush_Aged_1pct_bulk'], 22, '      USAXSscan        35.2      43      4      "Pyush_Aged_1pct_bulk"')''

date: 2021-10-16 09:22:57.605781
command file: /share1/USAXS_data/2021-10/usaxs.mac
line number: 22
command: ('USAXSscan', ['35.2', '43', '4', 'Pyush_Aged_1pct_bulk'], 22, '      USAXSscan        35.2      43      4      "Pyush_Aged_1pct_bulk"')
raw command:       USAXSscan        35.2      43      4      "Pyush_Aged_1pct_bulk"
attempt: 1 of 1
exception: Cannot bundle readings without an open run. That is, 'create' must be preceded by 'open_run'.
Stopping further processing of this command list.

Out[4]: :22:57 - memory report: pmem(rss=359682048, vms=4692082688, shared=65298432, text=2506752, lib=0, data=4062855168, dirty=0)
('f114e949-80a2-4e24-abfb-081a5773a176',
 '0cacd308-0eb5-4d8b-82ab-7db0bc2ae6fc',
 '3e7d297e-2dca-4602-9bc7-a49d605f6f55',
 'c58cd97a-9c42-4df1-bbd6-15917e6471d4',
 '2dc9bcf2-5bd8-436d-963a-700135d57a28')

this also sent this e-mail:

subject: IllegalMessageSequence during attempt 1 of 1 of command '('USAXSscan', ['35.2', '43', '4', 'Pyush_Aged_1pct_bulk'], 22, '      USAXSscan        35.2      43      4      "Pyush_Aged_1pct_bulk"')''

date: 2021-10-16 09:22:57.605781
command file: /share1/USAXS_data/2021-10/usaxs.mac
line number: 22
command: ('USAXSscan', ['35.2', '43', '4', 'Pyush_Aged_1pct_bulk'], 22, '      USAXSscan        35.2      43      4      "Pyush_Aged_1pct_bulk"')
raw command:       USAXSscan        35.2      43      4      "Pyush_Aged_1pct_bulk"
attempt: 1 of 1
exception: Cannot bundle readings without an open run. That is, 'create' must be preceded by 'open_run'.
Stopping further processing of this command list.

Looks to me that may be we have issue with suspendor suspending at wrong time?

jilavsky commented 2 years ago

ANother beam dump - we got to test this really well today. This also does not look good for continuing with data collection:

I Sat-09:35:32 - sample image file: /share1/USAXS_data/2021-10/10_16_Pyush2/10_16_Pyush2_usaxs/Pyush_Aged_1pct_top_0115.jpg                                                   

Transient Scan ID: 27     Time: 2021-10-16 09:35:32
Persistent Unique Scan ID: 'c875114d-fdf7-4ca4-a902-c0709e1ecbce'
New stream: 'baseline'
New stream: 'aps_current_monitor'
I Sat-09:35:32 - HDF5 config: /share1/AreaDetectorConfig/FlyScan_config/saveFlyData.xml
I Sat-09:35:32 - HDF5 output: /share1/USAXS_data/2021-10/10_16_Pyush2/10_16_Pyush2_usaxs/Pyush_Aged_1pct_top_0115.h5
I Sat-09:35:32 - truncating long status message: FlyScanning: Pyush_Aged_1pct_top_0115.h5
D Sat-09:35:32 - progress_reporting has arrived
I Sat-09:35:32 - flying, s    ar, deg      ay, mm       dy, mm       channel      elapsed, s 
D Sat-09:35:37 - 5.00         8.7429294    0.00713      13.15394     0            0.00       
D Sat-09:35:42 - 10.01        8.7417444    0.00207      13.13393     42           3.68       
D Sat-09:35:47 - 15.01        8.7407596    -0.00180     13.11860     79           8.70       
D Sat-09:35:52 - 20.02        8.7378478    -0.01404     13.06992     178          13.70      
D Sat-09:35:57 - 25.02        8.7304743    -0.04407     12.94925     378          18.70      
D Sat-09:36:02 - 30.03        8.7172640    -0.09842     12.73538     670          23.70      
D Sat-09:36:07 - 35.03        8.6959354    -0.18631     12.38989     1038         28.72      
D Sat-09:36:12 - 40.03        8.6657630    -0.31082     11.89987     1463         33.72      
D Sat-09:36:17 - 45.04        8.6237471    -0.47570     11.24097     1930         38.72      
D Sat-09:36:22 - 50.05        8.5707036    -0.69560     10.36091     2431         43.72      
D Sat-09:36:27 - 55.05        8.5028882    -0.97277     9.28952      2958         48.72      
D Sat-09:36:32 - 60.05        8.4191995    -1.30581     7.95428      3504         53.72      
D Sat-09:36:37 - 65.06        8.3188559    -1.71730     6.31367      4069         58.75      
D Sat-09:36:42 - 70.06        8.1995525    -2.19109     4.44965      4640         63.72      
Suspending....To get prompt hit Ctrl-C twice to pause.
Suspension occurred at 2021-10-16 09:36:42.
Justification for this suspension:
Signal usaxs_CheckBeamStandard is low
Suspending....To get prompt hit Ctrl-C twice to pause.
Suspension occurred at 2021-10-16 09:36:43.
Justification for this suspension:
Signal white_beam_ready_available is low
D Sat-09:36:47 - 75.07        8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:36:52 - 80.08        8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:36:57 - 85.09        8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:37:02 - 90.09        8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:37:07 - 95.10        8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:37:12 - 100.10       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:37:17 - 105.11       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:37:22 - 110.12       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:37:27 - 115.13       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:37:32 - 120.13       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:37:37 - 125.14       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:37:42 - 130.14       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:37:47 - 135.15       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:37:52 - 140.15       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:37:57 - 145.16       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:38:02 - 150.17       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:38:07 - 155.17       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:38:12 - 160.17       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:38:17 - 165.17       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:38:22 - 170.17       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:38:27 - 175.18       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:38:32 - 180.18       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:38:37 - 185.18       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:38:42 - 190.19       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:38:47 - 195.19       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:38:52 - 200.20       8.1844312    -2.23394     4.28169      4701         65.53      
D Sat-09:38:57 - 205.21       8.1844312    -2.23394     4.28169      4701         65.53      
I Sat-09:39:02 - 210.02       8.1844312    -2.23394     4.28169      4701         65.53      
E Sat-09:39:02 - 210.02007913589478s - progress_reporting timeout!!

This is really weird - why did the stages stop moving? Ar, Dy, and Ay both are standing still after suspension. BUT, this is epics scan, BS should not be controlling position of these stages at this time. These stages are following epcis defined trajectory and no control software should try to control them until they report done. No surprise this will fail. Expectation: When beam dump happens during Flyscan, scan is finished, data are collected and BS will pause before trying to collect next data set. But everything before we try to collect next data set should be finished, all the "decorations", cleanup etc. At this moment instrument is in unknown state at weird positions and recovery can be difficult.

Conclusion: we may need to implement our own suspender if we cannot improve on existing one.

jilavsky commented 2 years ago

Marking high priority as this will be complicated to fix, but with current APS operations this is critically needed.

jilavsky commented 2 years ago

TODO: Suspender needs to enable feedback for the100sec recovery and then set back to its prior value.

jilavsky commented 2 years ago

NOTE - suspender seems smarter than expected. After beam came back, the flyscan was re run and data were collected. And they look good. This is surprising that suspender will retry to data collection??? Weird note: dy controller had error after suspender recovered. I had to reenable it manually, even though it might be reset by BS at some point also. Not sure what caused the controller error in first place. I disabled an enabled momentarily the motor and error went away.

prjemian commented 2 years ago

That's how it should work. There is a yield from bps.checkpoint() between scans. RE rewinds after resume to the last checkpoint.

On Sat, Oct 16, 2021, 10:19 AM Jan Ilavsky @.***> wrote:

NOTE - suspender seems smarter than expected. After beam came back, the flyscan was re run and data were collected. And they look good. This is surprising that suspender will retry to data collection??? Weird note: dy controller had error after suspender recovered. I had to reenable it manually, even though it might be reset by BS at some point also. Not sure what caused the controller error in first place. I disabled an enabled momentarily the motor and error went away.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/APS-USAXS/ipython-usaxs/issues/523#issuecomment-944931800, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMUMFSX73UDB5EOHFFRHTUHGJZHANCNFSM5GDWN3SA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jilavsky commented 2 years ago

Well, that is actually smarter than I expected and well designed. I stand corrected. We are back to only the first issue - why did the first time fail and stopped running the scans and can we prevent that next time.

prjemian commented 2 years ago

The RunEngine will stop any positioners it is running. This is a design. I need to look up the exact rule on this. If the RE was controlling the motor for a scan, then its stop() will be called.

Questions I have:

When is stop() called, exactly?
Was the RE controlling the motors that stopped at the time? I don't believe it was since these were configured in an EPICS string sequence record. Not certain about that.
Will a RE interrupter provoke this? Not certain about that. Answer: yes, a suspend will pass through this code

            # During suspend, all motors should be stopped. Call stop() on
            # every object we ever set().
            self._stop_movable_objects(success=True)

Here's the RE's _stop_movable_objects() code:

    def _stop_movable_objects(self, *, success=True):
        "Call obj.stop() for all objects we have moved. Log any exceptions."
        for obj in self._movable_objs_touched:
            try:
                stop = obj.stop
            except AttributeError:
                self.log.debug("No 'stop' method available on %r", obj)
            else:
                try:
                    stop(success=success)
                except Exception:
                    self.log.exception("Failed to stop %r.", obj)

So, we need an answer about item 2 above.

prjemian commented 2 years ago

The RE's rewind feature (via checkpoints) is a principle reason that we use bluesky plans (Python's generator functions and the yield from plan() syntax) rather than straight Python functions.

prjemian commented 2 years ago

So it is not just any movable object, but rather the membership in self._movable_objs_touched (where self refers to the RE).

prjemian commented 2 years ago

A movable is added to that list when the RE receives a MSG("set", movable, position) message. Seems inevitable.

    async def _set(self, msg):
        """
        Set a device and cache the returned status object.
        Also, note that the device has been touched so it can be stopped upon
        exit.
        Expected message object is
            Msg('set', obj, *args, **kwargs)
        where arguments are passed through to `obj.set(*args, **kwargs)`.
        """
        kwargs = dict(msg.kwargs)
        group = kwargs.pop('group', None)
        self._movable_objs_touched.add(msg.obj)
...

prjemian commented 2 years ago

Time to scratch our heads and scheme.

prjemian commented 2 years ago

The list is cleared (same file) during this code:

    def _clear_call_cache(self):
        "Clean up for a new __call__ (which may encompass multiple runs)."
        self._metadata_per_call.clear()
        self._staged.clear()
        self._objs_seen.clear()
        self._movable_objs_touched.clear()
...

and that code (_clear_call_cache()) is called when a plan is executed. That is, when calling RE(plan()) from the command line.

prjemian commented 2 years ago

At the risk of shooting ourselves in the foot (note the method starts with _ which, by convention, means it is internal and should not be called by others), we could check that list for certain movables and remove them. Since it is a Python set (not a list that can be edited in place), we'd need to change that with a revised set without the certain movables. Something like this:

RE._movable_objs_touched = set(the movables in `RE._movable_objs_touched` without ar, ay, and dy)

APS-USAXS / usaxs-bluesky-ended-2023

Beamdump caused end of list execution due to error #523