Bluesky Crash: Scaler PV Readout Problem

qzhang234 commented 3 years ago

Bluesky crashed at 05:48 am on 01/21 after ~ 12 hours of continuous operation (11,155 out of 20,000 measurements).

The crash is caused by 8idi:scaler1.CNT. This PV is not used by DM workflow to the best of my knowledge.

The error message from the terminal is attached.

Error message from Bluesky

``` HDF5 workflow file name: /home/8ididata/2021-1/demo202101/F107_11154_att00_Test/F107_11154_att00_Test_0001-100000.hdf F107_11155_att00_Test HDF5 workflow file name: /home/8ididata/2021-1/demo202101/F107_11155_att00_Test/F107_11155_att00_Test_0001-100000.hdf F107_11156_att00_Test --------------------------------------------------------------------------- FailedStatus Traceback (most recent call last) in ----> 1 RE(multi_capil()) ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/run_engine.py in __call__(self, *args, **metadata_kw) 805 self._task_fut.add_done_callback(set_blocking_event) 806 --> 807 self._resume_task(init_func=_build_task) 808 809 if self._interrupted: ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/run_engine.py in _resume_task(self, init_func) 929 if (exc is not None 930 and not isinstance(exc, _RunEnginePanic)): --> 931 raise exc 932 933 def install_suspender(self, suspender): ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/run_engine.py in _run(self) 1498 exit_reason = str(err) 1499 self.log.exception("Run aborted") -> 1500 raise err 1501 finally: 1502 if not exit_reason: ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/run_engine.py in _run(self) 1363 else: 1364 try: -> 1365 msg = self._plan_stack[-1].send(resp) 1366 # We have exhausted the top generator 1367 except StopIteration: ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in __call__(self, plan) 1305 plan = monitor_during_wrapper(plan, self.monitors) 1306 plan = baseline_wrapper(plan, self.baseline) -> 1307 return (yield from plan) 1308 1309 ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in baseline_wrapper(plan, devices, name) 1160 return (yield from plan) 1161 else: -> 1162 return (yield from plan_mutator(plan, insert_baseline)) 1163 1164 ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 168 continue 169 else: --> 170 raise ex 171 # if inserting / mutating, put new generator on the stack 172 # and replace the current msg with the first element from the ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 121 ret = result_stack.pop() 122 try: --> 123 msg = plan_stack[-1].send(ret) 124 except StopIteration as e: 125 # discard the exhausted generator ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in monitor_during_wrapper(plan, signals) 801 plan1 = plan_mutator(plan, insert_after_open) 802 plan2 = plan_mutator(plan1, insert_before_close) --> 803 return (yield from plan2) 804 805 ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 168 continue 169 else: --> 170 raise ex 171 # if inserting / mutating, put new generator on the stack 172 # and replace the current msg with the first element from the ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 121 ret = result_stack.pop() 122 try: --> 123 msg = plan_stack[-1].send(ret) 124 except StopIteration as e: 125 # discard the exhausted generator ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 168 continue 169 else: --> 170 raise ex 171 # if inserting / mutating, put new generator on the stack 172 # and replace the current msg with the first element from the ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 121 ret = result_stack.pop() 122 try: --> 123 msg = plan_stack[-1].send(ret) 124 except StopIteration as e: 125 # discard the exhausted generator ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in fly_during_wrapper(plan, flyers) 859 plan1 = plan_mutator(plan, insert_after_open) 860 plan2 = plan_mutator(plan1, insert_before_close) --> 861 return (yield from plan2) 862 863 ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 168 continue 169 else: --> 170 raise ex 171 # if inserting / mutating, put new generator on the stack 172 # and replace the current msg with the first element from the ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 121 ret = result_stack.pop() 122 try: --> 123 msg = plan_stack[-1].send(ret) 124 except StopIteration as e: 125 # discard the exhausted generator ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 168 continue 169 else: --> 170 raise ex 171 # if inserting / mutating, put new generator on the stack 172 # and replace the current msg with the first element from the ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 121 ret = result_stack.pop() 122 try: --> 123 msg = plan_stack[-1].send(ret) 124 except StopIteration as e: 125 # discard the exhausted generator /home/beams10/8IDIUSER/bluesky_data/2021/2021-1/bp_Demo_Rigaku.py in multi_capil() 93 z0 = 0 94 ---> 95 yield from repeat_acq(sample_name='Test',att_value=0,acq_rep=20000,samx_cen=0,samz_cen=0) 96 /home/beams10/8IDIUSER/bluesky_data/2021/2021-1/bp_Demo_Rigaku.py in repeat_acq(areadet, file_header, acq_rep, sample_name, Temp_list, att_value, FLY_YES_NO, samx_cen, samz_cen) 69 print(acq_name) 70 ---> 71 yield from AD_Acquire( 72 rigaku, 73 acquire_time=19.00e-6, ~/.ipython-bluesky/profile_bluesky/startup/instrument/plans/xpcs_acquire.py in AD_Acquire(areadet, file_name, acquire_time, acquire_period, num_images, path, submit_xpcs_job, atten, md) 317 318 logger.info("calling full_acquire_procedure()") --> 319 return (yield from full_acquire_procedure(md=md)) ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/utils.py in dec_inner(*inner_args, **inner_kwargs) 1125 plan = gen_func(*inner_args, **inner_kwargs) 1126 plan = wrapper(plan, *args, **kwargs) -> 1127 return (yield from plan) 1128 return dec_inner 1129 return dec ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in stage_wrapper(plan, devices) 950 return (yield from plan) 951 --> 952 return (yield from finalize_wrapper(inner(), unstage_devices())) 953 954 ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in finalize_wrapper(plan, final_plan, pause_for_debug) 507 cleanup = True 508 try: --> 509 ret = yield from plan 510 except GeneratorExit: 511 cleanup = False ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in inner() 948 def inner(): 949 yield from stage_devices() --> 950 return (yield from plan) 951 952 return (yield from finalize_wrapper(inner(), unstage_devices())) ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/utils.py in dec_inner(*inner_args, **inner_kwargs) 1125 plan = gen_func(*inner_args, **inner_kwargs) 1126 plan = wrapper(plan, *args, **kwargs) -> 1127 return (yield from plan) 1128 return dec_inner 1129 return dec ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in monitor_during_wrapper(plan, signals) 801 plan1 = plan_mutator(plan, insert_after_open) 802 plan2 = plan_mutator(plan1, insert_before_close) --> 803 return (yield from plan2) 804 805 ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 76 # if we have a stashed exception, pass it along 77 try: ---> 78 msg = plan_stack[-1].throw(exception) 79 except StopIteration as e: 80 # discard the exhausted generator ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 76 # if we have a stashed exception, pass it along 77 try: ---> 78 msg = plan_stack[-1].throw(exception) 79 except StopIteration as e: 80 # discard the exhausted generator ~/.ipython-bluesky/profile_bluesky/startup/instrument/plans/xpcs_acquire.py in full_acquire_procedure(md) 279 logger.debug("metadata = %s", _md) 280 # start autocount on the scaler --> 281 yield from bps.mv(scaler1.count, "Count") 282 logger.info("scaler should be autocounting now") 283 ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/plan_stubs.py in mv(group, *args, **kwargs) 256 ret = yield Msg('set', obj, val, group=group, **kwargs) 257 status_objects.append(ret) --> 258 yield Msg('wait', None, group=group) 259 return tuple(status_objects) 260 ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 193 try: 194 # yield out the 'current message' and collect the return --> 195 inner_ret = yield msg 196 except GeneratorExit: 197 # special case GeneratorExit. We must clean up all of our plans ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 193 try: 194 # yield out the 'current message' and collect the return --> 195 inner_ret = yield msg 196 except GeneratorExit: 197 # special case GeneratorExit. We must clean up all of our plans ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 193 try: 194 # yield out the 'current message' and collect the return --> 195 inner_ret = yield msg 196 except GeneratorExit: 197 # special case GeneratorExit. We must clean up all of our plans ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 193 try: 194 # yield out the 'current message' and collect the return --> 195 inner_ret = yield msg 196 except GeneratorExit: 197 # special case GeneratorExit. We must clean up all of our plans ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 193 try: 194 # yield out the 'current message' and collect the return --> 195 inner_ret = yield msg 196 except GeneratorExit: 197 # special case GeneratorExit. We must clean up all of our plans ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 193 try: 194 # yield out the 'current message' and collect the return --> 195 inner_ret = yield msg 196 except GeneratorExit: 197 # special case GeneratorExit. We must clean up all of our plans ~/.conda/envs/bluesky_2021_1/lib/python3.8/site-packages/bluesky/preprocessors.py in plan_mutator(plan, msg_proc) 193 try: 194 # yield out the 'current message' and collect the return --> 195 inner_ret = yield msg 196 except GeneratorExit: 197 # special case GeneratorExit. We must clean up all of our plans FailedStatus: Status(obj=EpicsSignal(read_pv='8idi:scaler1.CNT', name='scaler1_count', parent='scaler1', value=0, timestamp=1611229690.105904, auto_monitor=True, string=False, write_pv='8idi:scaler1.CNT', limits=False, put_complete=False), done=True, success=False) In [6]: ```

Also attached is a screenshot from the DM workflow monitor:

qzhang234 commented 3 years ago

I have a question: is it possible to introduce an additional attribute in EpicsSignal class called 'Essential'? If a PV is listed as non-essential, that PV's disconnection/malfunction WILL NOT crash a scan. As useful as scalars are, I don't think not being able to ping a scale PV 1 out of 10,000 times should be the cause of a long-term stability problem.

This will also allow staff scientists at different facilities to customize their own Bluesky install so that it hits everyone's own sweet spot on the entire spectrum, anywhere from 'Crash Early Crash Hard' to 'Run Even If the Sky Falls'.

What do you think, Pete @prjemian ?

prjemian commented 3 years ago

8idi:scaler1.CNT is used to tell the scaler to count and detect when the scaler has finished counting. If this is not needed in the scan, why is it being called at all?

prjemian commented 3 years ago

Adding such an attribute to EpicsSignal is possible on a local level. Getting that to be used by all possible Devices that use EpicsSignal is monumental. The standard response is a question of Why is this PV causing a FailedStatus exception? That is the problem to be fixed instead of spending weeks of time implementing an essential attribute.

So, I do not believe that adding such an attribute is the proper way to fix this problem.

qzhang234 commented 3 years ago

I totally agree. I'm starting to ask the question myself as WHY is the scaler even being called in the first place. I believe we (or more like I) need to go over AD_Acquire line by line to make sure I know EXACTLY what PV's are called and remove the ones that are not necessary.

That should provide a more fundamental solution and also improve my Bluesky debugging skills.

qzhang234 commented 3 years ago

This is the line that caused the crash. It's consistent with the error in the screen output: https://github.com/aps-8id-dys/ipython-8idiuser/blob/ad96dd733f2fbb478bbb1156cb683fea92475731/profile_bluesky/startup/instrument/plans/xpcs_acquire.py#L281 And this is the PV in the crash line. It's literally the 'Count' button on the scaler. None of the PVs on scaler1 is essential for the analysis nor grabbed by DM Workflow.

qzhang234 commented 3 years ago

For future debugging, I would like to propose the following:

Run Bluesky;
Identify PV that causes the crash;
If PV is non-essential / non-changing for analysis, remove from monitor list / copy from pre-made json list at the beginning of measurement;
Keep iterating till the plan does not crash / the plan crashes on an essential PV.

I think I now know Bluesky well enough to identify which line causes the crash, but I don't know Bluesky well enough to remove the corresponding PV in a clean manner. @prjemian Could you teach me?

prjemian commented 3 years ago

Sent you a Teams invite for the morning at 10 am.

qzhang234 commented 3 years ago

Close this for now. Will wait for AD_Acquire for Rigaku to complete and resume Bluesky stability test. If the problem reoccurs then will submit another issue.

aps-8id-dys / ipython-8idiuser

Bluesky Crash: Scaler PV Readout Problem #250