Data collection stopped with FCCD area detector and repeated error messages after abort

ambarb commented 6 years ago

Scanning the detector at 0.1s exposure with total acquire period of 0.3. 20 images per point

After the 3rd point, the data collection paused at 20%. After aborting, pressing enter gives multiple lines of the same (or similiar) errors. FCCD when back to normal state after run aborted.

Had to quit bsui.

On restarting, could not reproduce the error. But noticed that when I made the number of images per point = 1, that there were a lot of dropped points. This is okay. But is a big clue to the issue reported below. Maybe bluesky lost count at 20% for point 3 and then could never recover. Is there a way to make this progress bar more robust? Below is not highly reproducible, but it is certainly not a one off.

In [414]: fccd.hints = {'fields': ['fccd_stats4_total']}

In [415]: dets = [fccd]

In [416]: RE(count(dets))
Transient Scan ID: 82994 @ 2017/09/10 12:52:48
Persistent Unique Scan ID: '6dd81297-df82-4805-9c99-511dce1978d8'
+-----------+------------+-------------------+███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.39images/s]
|   seq_num |       time | fccd_stats4_total |                                                                                                                                                           
+-----------+------------+-------------------+
|         1 | 12:52:54.8 |             50051 |
+-----------+------------+-------------------+
generator count ['6dd812'] (scan num: 82994)
Out[416]: ['6dd81297-df82-4805-9c99-511dce1978d8']

In [417]: RE(count(dets,num=None))
Transient Scan ID: 82995 @ 2017/09/10 12:53:36
Persistent Unique Scan ID: '092ac1d1-456a-4fb6-be27-a37030781352'
+-----------+------------+-------------------+                                                                                                                                                           
|   seq_num |       time | fccd_stats4_total |
+-----------+------------+-------------------+
|         1 | 12:53:43.4 |            191957 |
|         2 | 12:53:49.3 |             88205 |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.39images/s]
|         3 | 12:53:55.2 |            406275 |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.40images/s]
fccd 20%|███████████████████████████████▍                                                                                                                             | 4/20 [00:01<00:04,  3.70images/s]
  A 'deferred pause' has been requested. The RunEngine will pause at the next checkpoint. To pause immediately, hit Ctrl+C again in the next 10 seconds.
Deferred pause acknowledged. Continuing to checkpoint.
^C
Your RunEngine is entering a paused state. These are your options for changing
the state of the RunEngine:

RE.resume()    Resume the plan.
RE.abort()     Perform cleanup, then kill plan. Mark exit_stats='aborted'.
RE.stop()      Perform cleanup, then kill plan. Mark exit_status='success'.
RE.halt()      Emergency Stop: Do not perform cleanup --- just stop.

Pausing...
Out[417]: ['092ac1d1-456a-4fb6-be27-a37030781352']

In [418]: RE.abort()
Aborting: running cleanup and marking exit_status as 'abort'...
+-----------+------------+-------------------+                                                                                                                                                           
generator count ['092ac1'] (scan num: 82995)

In [419]: ERROR:ophyd.ophydobj:Subscription value callback exception (EpicsSignalWithRBV(read_pv='XF:23ID1-ES{FCCD}cam1:ArrayCounter_RBV', name='fccd_cam_array_counter', parent='fccd_cam', value=82186)
Traceback (most recent call last):
  File "/home/xf23id1/conda_envs/dev/lib/python3.6/site-packages/ophyd/ophydobj.py", line 104, in _run_sub
    cb(*args, **kwargs)
  File "/home/xf23id1/conda_envs/dev/lib/python3.6/site-packages/ophyd/areadetector/trigger_mixins.py", line 74, in _notify_watchers
    time_remaining=time_remaining)
  File "/home/xf23id1/conda_envs/dev/lib/python3.6/site-packages/bluesky/utils.py", line 1005, in update
    self.meters[pos] = meter
IndexError: list assignment index out of range
ERROR:ophyd.ophydobj:Subscription value callback exception (EpicsSignalWithRBV(read_pv='XF:23ID1-ES{FCCD}cam1:ArrayCounter_RBV', name='fccd_cam_array_counter', parent='fccd_cam', value=821870, timesta)
Traceback (most recent call last):
  File "/home/xf23id1/conda_envs/dev/lib/python3.6/site-packages/ophyd/ophydobj.py", line 104, in _run_sub
    cb(*args, **kwargs)
  File "/home/xf23id1/conda_envs/dev/lib/python3.6/site-packages/ophyd/areadetector/trigger_mixins.py", line 74, in _notify_watchers

danielballan commented 6 years ago

In this case, the progress bar is watching the ArrayCounter PV. I'm not sure how to make that more robust.

Can you clarify what you meant by, "But noticed that when I made the number of images per point = 1, that there were a lot of dropped points"?

ambarb commented 6 years ago

@danielballan RE: "But noticed that when I made the number of images per point = 1, that there were a lot of dropped points"? This was because the DONE TRIGGERING thing happened immediately. See https://github.com/NSLS-II/Bug-Reports/issues/178

ambarb commented 6 years ago

The original occurrence at the top happened two more times. On these occasiotins we did not need to restart bluesky.

Here is the plan we are running:

def single_ph_ct(sets):
    yield from ssh_close()
    yield from bp.sleep(2)
    yield from count(dets)
    olog('ScanNo{} Darks at {:.2f} eV'.format(db[-1].start['scan_id'], pgm_en.readback.value))
    yield from ssh_open()
    yield from bp.sleep(2)
    yield from count(dets,num=sets)
    olog('ScanNo{} Lights at {:.2f} eV'.format(db[-1].start['scan_id'], pgm_en.readback.value))
    yield from ssh_close()
    yield from bp.sleep(2)
    yield from count(dets)
    olog('ScanNo{} Darks at {:.2f} eV'.format(db[-1].start['scan_id'], pgm_en.readback.value))

Here is one example of the stopped scan (with no keyboard interupt)

In [37]: 

In [37]: 

In [37]: RE(single_ph_ct(20))
Transient Scan ID: 83014 @ 2017/09/10 18:43:47                                                                                                                                                           
Persistent Unique Scan ID: 'dc3b9b8d-2aec-42f1-9b95-ade00e042678'
+-----------+------------+-------------------+███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [03:20<00:00, 24.99images/s]
|   seq_num |       time | fccd_stats4_total |                                                                                                                                                           
+-----------+------------+-------------------+
|         1 | 18:47:07.9 |             81754 |
+-----------+------------+-------------------+
generator count ['dc3b9b'] (scan num: 83014)
Transient Scan ID: 83015 @ 2017/09/10 18:47:11                                                                                                                                                           
Persistent Unique Scan ID: '77d2332f-8ab2-4208-a0e6-29ec85f4690c'
+-----------+------------+-------------------+                                                                                                                                                           
|   seq_num |       time | fccd_stats4_total |
+-----------+------------+-------------------+
|         1 | 18:50:32.5 |            222849 |
|         2 | 18:53:52.6 |            273249 |                                                                                                                                                           
|         3 | 18:57:12.7 |            176688 |                                                                                                                                                           
|         4 | 19:00:32.7 |            208191 |                                                                                                                                                           
|         5 | 19:03:53.0 |            255491 |                                                                                                                                                           
fcA 'deferred pause' has been requested. The RunEngine will pause at the next checkpoint. To pause immediately, hit Ctrl+C again in the next 10 seconds.█████    | 4872/5000 [03:14<00:05, 25.00images/s]
Deferred pause acknowledged. Continuing to checkpoint.
^C
Your RunEngine is entering a paused state. These are your options for changing
the state of the RunEngine:

RE.resume()    Resume the plan.
RE.abort()     Perform cleanup, then kill plan. Mark exit_stats='aborted'.
RE.stop()      Perform cleanup, then kill plan. Mark exit_status='success'.
RE.halt()      Emergency Stop: Do not perform cleanup --- just stop.

Pausing...
Out[37]: 
['dc3b9b8d-2aec-42f1-9b95-ade00e042678',
 '77d2332f-8ab2-4208-a0e6-29ec85f4690c']

In [38]: RE.stop()
Stopping: running cleanup and marking exit_status as 'success'...
+-----------+------------+-------------------+                                                                                                                                                           
generator count ['77d233'] (scan num: 83015)

Both PVs monitored seperately indicated that the camera was stuck in an acquire mode: XF:23ID1-ES{FCCD}HDF1:NumCaptured_RBV XF:23ID1-ES{FCCD}cam1:ArrayCounter_RBV

Will try to capture graphically again if this happens. The attached plot of the image PVs show that the camera actually stopped. @stuwilkins do you think there is something on our side?

ambarb commented 6 years ago

@danielballan The root cause is likely something to do with our detector. However, if there is a timeout on motors, why is there not a timeout on detectors if they are not updating? During this failure mode, the FCCD is essentially frozen until we ^C^C and abort or stop. After which, the detector behaves as expected. If we were to exit the RE because of a time out, I assume that the detector would be allowed to self-trigger again (as if aborting or stopping). Am i correct?

danielballan commented 6 years ago

Note for anyone following along: we resolved this locally. Also see https://github.com/NSLS-II/ophyd/pull/444

NSLS-II / Bug-Reports

Data collection stopped with FCCD area detector and repeated error messages after abort #179