NSLS-II / Bug-Reports

Unified issue-tracker for bugs in the data acquisition, management, and analysis software at NSLS-II
BSD 3-Clause "New" or "Revised" License
2 stars 5 forks source link

Cannot retrieve scan data due to forced abort() #121

Closed ambarb closed 8 years ago

ambarb commented 8 years ago

@tacaswell @danielballan A user initiated a scan to collect 7200. They then decided to stop the scan. When the command prompt returned, they typed stop() but the camera was still writing data. They decided to abort(). Now they cannot access the data. The user is here until Tuesday morning. Can someone please look into this today - for them to get the data, and then why this happened at your convenience. I will try to trace back through the terminal in a bit.

Scan info:

6/19/16, 11:03 am generator ['86afec'](scan num: 65634)

danielballan commented 8 years ago

I'll stop by today to look at this.

danielballan commented 8 years ago

A user initiated a scan to collect 7200.

What kind of scan? Does 7200 refer to fccd.num_images?

I will try to trace back through the terminal in a bit.

If you can do that, I think it would help.

'86afec'

There seems to be no RunStart document in your metadatastore with this uid. Did the users kill the scan before the RunStart even came out? (Did you get an olog entry?)

ambarb commented 8 years ago

Yes, fccd.num_images. The ^C during the scan when they reach 5000 images collected. The record is in olog Aquistion. That is why I know the unique id string. I see I forgot to copy the SCAN ID # (65634)

Tried '86afec', and same errors. See below from jhub notebook

bgnd8 = DataBroker[65505] #[59111]
bgnd2 = None #DataBroker[59112] #DataBroker['165e034b-83de-4769-b80f-447194e701c5'] #None #DataBroker[59112]
bgnd1 = None # DataBroker[''] #[]
data = DataBroker['86afec'] #[59110]

tbl=get_table(data,fill='False')

tbl  #prints a table with all the readable detectors (motor positions at time of measurement)

OUTPUT is a square

stack = get_fastccd_images(data, (bgnd8, bgnd2, bgnd1), tag='fccd_image')  #fccd_image_lightfield
images = stack[0]

OUTPUT

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-e2261f830d28> in <module>()
----> 1 stack = get_fastccd_images(data, (bgnd8, bgnd2, bgnd1), tag='fccd_image')  #fccd_image_lightfield
      2 images = stack[0]

/opt/conda_envs/analysis/lib/python3.4/site-packages/csxtools/utils.py in get_fastccd_images(light_header, dark_headers, flat, gain, tag, roi, handler_override)
    127 
    128     events = _get_images(light_header, tag, roi,
--> 129                          handler_override=handler_override)
    130 
    131     # Ok, so lets return a pims pipeline which does the image conversion

/opt/conda_envs/analysis/lib/python3.4/site-packages/csxtools/utils.py in _get_images(header, tag, roi, handler_override)
    187 def _get_images(header, tag, roi=None, handler_override=None):
    188     t = ttime.time()
--> 189     images = get_images(header, tag, handler_override=handler_override)
    190     t = ttime.time() - t
    191     logger.info("Took %.3f seconds to read data using get_images", t)

/opt/conda_envs/analysis/lib/python3.4/site-packages/databroker/pims_readers.py in get_images(headers, name, handler_registry, handler_override)
     29             # do something
     30     """
---> 31     return Images(headers, name, handler_registry, handler_override)
     32 
     33 

/opt/conda_envs/analysis/lib/python3.4/site-packages/databroker/pims_readers.py in __init__(self, headers, name, handler_registry, handler_override)
     56         """
     57         events = get_events(headers, [name], fill=False)
---> 58         self._datum_uids = [event.data[name] for event in events
     59                             if name in event.data]
     60         self._len = len(self._datum_uids)

/opt/conda_envs/analysis/lib/python3.4/site-packages/databroker/pims_readers.py in <listcomp>(.0)
     56         """
     57         events = get_events(headers, [name], fill=False)
---> 58         self._datum_uids = [event.data[name] for event in events
     59                             if name in event.data]
     60         self._len = len(self._datum_uids)

/opt/conda_envs/analysis/lib/python3.4/site-packages/databroker/databroker.py in get_events(headers, fields, fill, handler_registry, handler_overrides)
    364         fields = []
    365     fields = set(fields)
--> 366     _check_fields_exist(fields, headers)
    367 
    368     for header in headers:

/opt/conda_envs/analysis/lib/python3.4/site-packages/databroker/databroker.py in _check_fields_exist(fields, headers)
    641     missing = fields - all_fields
    642     if missing:
--> 643         raise ValueError("The fields %r were not found." % missing)

ValueError: The fields {'fccd_image'} were not found.
ambarb commented 8 years ago

here is the transcript from the bluesky terminal. Their definition calls RE(ct()). The stopped it when 5000 images were taken. They were worried that they would have trouble with 7200 images because they were running out of memory and were not sure that we could solve the problem for them. That was fixed, but users will often need to stop a scan like this.

In [69]: fccd_set(1,7200)
Current fCCD config: 7200.0 images @ 1.00 Hz = 7200.0 s

In [70]: %run -i ~/Users/2016_06_Porro/2016_06_19_overnight.py
In [71]: mag_setup()

In [72]: mag_reset_meas(1,7200)
Current fCCD config: 7200.0 images @ 1.00 Hz = 7200.0 s
Transient Scan ID: 65634
Persistent Unique Scan ID: '86afecc1-d751-4010-9acb-c4dde3b0570c'
^CA 'deferred pause' has been requested. The RunEngine will pause at the next checkpoint. To pause immediately, hit Ctrl+C again in the next 10 seconds.
Deferred pause acknowledged. Continuing to checkpoint.
stop()
^C^C
Your RunEngine is entering a paused state. These are your options for changing
the state of the RunEngine:

resume()  --> will resume the scan
 abort()  --> will kill the scan with an 'aborted' state to indicate
              the scan was interrupted
  stop()  --> will kill the scan with a 'finished' state to indicate
              the scan stopped normally

Pro Tip: Next time, if you want to abort, tap Ctrl+C three times quickly.

Pausing...
OK

In [73]: stop()
Stopping...
^CA 'deferred pause' has been requested. The RunEngine will pause at the next checkpoint. To pause immediately, hit Ctrl+C again in the next 10 seconds.
Deferred pause acknowledged. Continuing to checkpoint.
^C
Your RunEngine is entering a paused state. These are your options for changing
the state of the RunEngine:

resume()  --> will resume the scan
 abort()  --> will kill the scan with an 'aborted' state to indicate
              the scan was interrupted
  stop()  --> will kill the scan with a 'finished' state to indicate
              the scan stopped normally

Pro Tip: Next time, if you want to abort, tap Ctrl+C three times quickly.

Pausing...

In [74]: abort()
Aborting....
generator ct ['86afec'] (scan num: 65634)

In [75]: olog('Measurement scan 34 aborted, temperature changed to 300K')

In [76]: fccd_set(1,60)
Current fCCD config: 60.0 images @ 1.00 Hz = 60.0 s

In [77]: ct_all_dark()

Starting procedure to acquire camera gains at 1.0Hz

Camera stopped..

Gain bit set to 2

Camera restarted..
Transient Scan ID: 65635
Persistent Unique Scan ID: '1c775526-f717-448a-9e86-d54070316cbf'
/home/xf23id1/conda_envs/collection/lib/python3.4/site-packages/bluesky/run_engine.py:1821: UserWarning: A ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='xf23id1-log.cs.nsls2.local', port=8181): Read timed out. (read timeout=30)",),) was raised during the processing of a start Document. The error will be ignored to avoid interrupting data collection. To investigate, set RunEngine.ignore_callback_exceptions = False and run again.
  "and run again." % (exc, name.name))
+-----------+------------+-------------------+-------------------+-------------------+------------+------------+------------+------------+------------+
|   seq_num |       time | fccd_stats1_total | fccd_stats2_total | fccd_stats3_total | mono_tempa | mono_tempb |  grt1_temp |  grt2_temp |    uw_temp |
+-----------+------------+-------------------+-------------------+-------------------+------------+------------+------------+------------+------------+
|         1 | 13:37:22.3 |        1112157899 |       14795198239 |       14800676775 |    302.749 |    302.749 |       31.6 |       28.8 |      85.12 |
+-----------+------------+-------------------+-------------------+-------------------+------------+------------+------------+------------+------------+
generator ct ['1c7755'] (scan num: 65635)
danielballan commented 8 years ago

OK, I understand what happened.

First of all, ignore my earlier comment, "There seems to be no RunStart document in your metadatastore with this uid." That was an unrelated problem I was having.

What Occurred

  1. The run was started. A RunStart document was emitted. It was saved in metadatastore, and we can load it from the databroker. It also triggered the creation of an olog entry. So far, so good.
  2. The detector was staged. This set a fresh HDF5 filename in the area detector and prepared it for writing.
  3. The detector was triggered. A multi-acquisition was begun. Data was progressively written into the HDF5 file.
  4. Normally, once the detector is done acquiring, an Event document is emitted and saved in metadatastore. The plan was aborted before this happened.

How this explains what you see now

This explains why tbl is an empty DataFrame. (This displays as an empty square; using print we can see a little more clearly what it is....)

In [5]: print(tbl)
Empty DataFrame
Columns: []
Index: []

The raw get_events function confirms that there is no data from this run in metadatastore --- just the RunStart document.

In [6]: len(list(get_events(h_data)))
Out[6]: 0

What next?

The data is likely salvageable. HDF5 file exists, but no record of it was created in metadatastore. I can help you retrieve it tomorrow morning before my group meeting at 10.

To avoid this problem in the future, we should eliminate the long delay between when the detector starts saving data into a file and when we emit the Event document that provides a reference to that data. This was on our radar, but this is the first time I've heard of this particular failure mode.

ambarb commented 8 years ago

Sounds great. I reproduced it just now. They have to quit stop() because that seems ineffective. Thus abort() is the only option. If it happens again, we will try selecting the done capture button while stuck in the loop that handles stop().

Transient Scan ID: 65692
Persistent Unique Scan ID: '94a4d26e-cf2c-45c7-98d9-9b4f7cfecbac'
+-----------+------------+-------------------+-------------------+-------------------+------------+------------+------------+------------+------------+
|   seq_num |       time | fccd_stats1_total | fccd_stats2_total | fccd_stats3_total | mono_tempa | mono_tempb |  grt1_temp |  grt2_temp |    uw_temp |
+-----------+------------+-------------------+-------------------+-------------------+------------+------------+------------+------------+------------+
|         1 | 15:23:54.3 |           -140604 |          -1790229 |          -1561333 |    302.756 |    302.755 |       29.2 |       28.8 |      85.38 |
+-----------+------------+-------------------+-------------------+-------------------+------------+------------+------------+------------+------------+
generator ct ['94a4d2'] (scan num: 65692)

Returned to initial gain setting and restarting acquisition
Current fCCD config: 7200.0 images @ 1.00 Hz = 7200.0 s
^C---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/home/xf23id1/Users/2016_06_Porro/2016_06_20_overnight.py in <module>()
----> 1 multi_T(1,7200)

/home/xf23id1/Users/2016_06_Porro/2016_06_20_overnight.py in multi_T(freq, tot_time_1)
     90         #mov(temp_sp,tt)
     91         caput('XF:23ID1-ES{TCtrl:1-Out:1}T-SP',tt)
---> 92         sleep(3600)
     93         RE(ct())
     94         olog('Measurement, #{} @ {:.1f} eV, {:.2f} Hz at {:.1f} K for {} images over {:.1f} s'.format((db[-1].get('start').get('scan_id')), pgm_en.get()[1], 1./fccd.acquire_time.get(), temp.a.value,fccd.plugin_num_images.get(),fccd.plugin_num_images.get()*fccd.acquire_time.get()))

KeyboardInterrupt: 

In [184]: 
KeyboardInterrupt

In [184]: 
KeyboardInterrupt

In [184]: stop()
---------------------------------------------------------------------------
TransitionError                           Traceback (most recent call last)
/home/xf23id1/Users/2016_06_Porro/2016_06_20_overnight.py in <module>()
----> 1 stop()

/home/xf23id1/conda_envs/collection/lib/python3.4/site-packages/bluesky/global_state.py in stop()
    166     This function closes over the global RunEngine instance, gs.RE.
    167     """
--> 168     return gs.RE.stop()
    169 
    170 

/home/xf23id1/conda_envs/collection/lib/python3.4/site-packages/bluesky/run_engine.py in stop(self)
    756         """
    757         if self.state.is_idle:
--> 758             raise TransitionError("RunEngine is already idle.")
    759         print("Stopping...")
    760         self._interrupted = True

TransitionError: RunEngine is already idle.

In [185]: grabit
Select area of screen to grab .........

In [186]: fccd_set(1,7200)
Current fCCD config: 7200.0 images @ 1.00 Hz = 7200.0 s

In [187]: %run -i ~/Users/2016_06_Porro/startup.py

In [188]: RE(ct())
Transient Scan ID: 65693
Persistent Unique Scan ID: '44407fa8-9acf-4b8a-bd27-568161906628'
^CA 'deferred pause' has been requested. The RunEngine will pause at the next checkpoint. To pause immediately, hit Ctrl+C again in the next 10 seconds.
Deferred pause acknowledged. Continuing to checkpoint.
^C
Your RunEngine is entering a paused state. These are your options for changing
the state of the RunEngine:

resume()  --> will resume the scan
 abort()  --> will kill the scan with an 'aborted' state to indicate
              the scan was interrupted
  stop()  --> will kill the scan with a 'finished' state to indicate
              the scan stopped normally

Pro Tip: Next time, if you want to abort, tap Ctrl+C three times quickly.

Pausing...
Out[188]: ['44407fa8-9acf-4b8a-bd27-568161906628']

In [189]: 

In [189]: stop()
Stopping...

1 minute later
^CA 'deferred pause' has been requested. The RunEngine will pause at the next checkpoint. To pause immediately, hit Ctrl+C again in the next 10 seconds.
Deferred pause acknowledged. Continuing to checkpoint.
^C
Your RunEngine is entering a paused state. These are your options for changing
the state of the RunEngine:

resume()  --> will resume the scan
 abort()  --> will kill the scan with an 'aborted' state to indicate
              the scan was interrupted
  stop()  --> will kill the scan with a 'finished' state to indicate
              the scan stopped normally

Pro Tip: Next time, if you want to abort, tap Ctrl+C three times quickly.

Pausing...

In [190]: stop()
Stopping...

1 minute later

abort()
^CA 'deferred pause' has been requested. The RunEngine will pause at the next checkpoint. To pause immediately, hit Ctrl+C again in the next 10 seconds.
Deferred pause acknowledged. Continuing to checkpoint.
^C
Your RunEngine is entering a paused state. These are your options for changing
the state of the RunEngine:

resume()  --> will resume the scan
 abort()  --> will kill the scan with an 'aborted' state to indicate
              the scan was interrupted
  stop()  --> will kill the scan with a 'finished' state to indicate
              the scan stopped normally

Pro Tip: Next time, if you want to abort, tap Ctrl+C three times quickly.

Pausing...

In [191]: 
KeyboardInterrupt

In [191]: abort()
Aborting....
generator ct ['44407f'] (scan num: 65693)

In [192]: 
danielballan commented 8 years ago

We are working on the underlying issue with stop() as well. Should straighten all this out by the end of the week. Thanks as always for your patience.

Get Outlook for iOShttps://aka.ms/o0ukef

On Mon, Jun 20, 2016 at 5:32 PM -0400, "ambarb" notifications@github.com<mailto:notifications@github.com> wrote:

Sounds great. I reproduced it just now. They have to quit stop() because that seems ineffective. Thus abort() is the only option. If it happens again, we will try selecting the done capture button while stuck in the loop that handles stop().

Transient Scan ID: 65692 Persistent Unique Scan ID: '94a4d26e-cf2c-45c7-98d9-9b4f7cfecbac' +-----------+------------+-------------------+-------------------+-------------------+------------+------------+------------+------------+------------+ | seq_num | time | fccd_stats1_total | fccd_stats2_total | fccd_stats3_total | mono_tempa | mono_tempb | grt1_temp | grt2_temp | uw_temp | +-----------+------------+-------------------+-------------------+-------------------+------------+------------+------------+------------+------------+ | 1 | 15:23:54.3 | -140604 | -1790229 | -1561333 | 302.756 | 302.755 | 29.2 | 28.8 | 85.38 | +-----------+------------+-------------------+-------------------+-------------------+------------+------------+------------+------------+------------+ generator ct ['94a4d2'](scan num: 65692)

Returned to initial gain setting and restarting acquisition Current fCCD config: 7200.0 images @ 1.00 Hz = 7200.0 s ^C--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) /home/xf23id1/Users/2016_06_Porro/2016_06_20_overnight.py in () ----> 1 multi_T(1,7200)

/home/xf23id1/Users/2016_06_Porro/2016_06_20_overnight.py in multi_T(freq, tot_time_1) 90 #mov(temp_sp,tt) 91 caput('XF:23ID1-ES{TCtrl:1-Out:1}T-SP',tt) ---> 92 sleep(3600) 93 RE(ct()) 94 olog('Measurement, #{} @ {:.1f} eV, {:.2f} Hz at {:.1f} K for {} images over {:.1f} s'.format((db[-1].get('start').get('scan_id')), pgm_en.get()[1], 1./fccd.acquire_time.get(), temp.a.value,fccd.plugin_num_images.get(),fccd.plugin_num_images.get()*fccd.acquire_time.get()))

KeyboardInterrupt:

In [184]: KeyboardInterrupt

In [184]: KeyboardInterrupt

In [184]: stop()

TransitionError Traceback (most recent call last) /home/xf23id1/Users/2016_06_Porro/2016_06_20_overnight.py in () ----> 1 stop()

/home/xf23id1/conda_envs/collection/lib/python3.4/site-packages/bluesky/global_state.py in stop() 166 This function closes over the global RunEngine instance, gs.RE. 167 """ --> 168 return gs.RE.stop() 169 170

/home/xf23id1/conda_envs/collection/lib/python3.4/site-packages/bluesky/run_engine.py in stop(self) 756 """ 757 if self.state.is_idle: --> 758 raise TransitionError("RunEngine is already idle.") 759 print("Stopping...") 760 self._interrupted = True

TransitionError: RunEngine is already idle.

In [185]: grabit Select area of screen to grab .........

In [186]: fccd_set(1,7200) Current fCCD config: 7200.0 images @ 1.00 Hz = 7200.0 s

In [187]: %run -i ~/Users/2016_06_Porro/startup.py

In [188]: RE(ct()) Transient Scan ID: 65693 Persistent Unique Scan ID: '44407fa8-9acf-4b8a-bd27-568161906628' ^CA 'deferred pause' has been requested. The RunEngine will pause at the next checkpoint. To pause immediately, hit Ctrl+C again in the next 10 seconds. Deferred pause acknowledged. Continuing to checkpoint. ^C Your RunEngine is entering a paused state. These are your options for changing the state of the RunEngine:

resume() --> will resume the scan abort() --> will kill the scan with an 'aborted' state to indicate the scan was interrupted stop() --> will kill the scan with a 'finished' state to indicate the scan stopped normally

Pro Tip: Next time, if you want to abort, tap Ctrl+C three times quickly.

Pausing... Out[188]: ['44407fa8-9acf-4b8a-bd27-568161906628']

In [189]:

In [189]: stop() Stopping...

1 minute later ^CA 'deferred pause' has been requested. The RunEngine will pause at the next checkpoint. To pause immediately, hit Ctrl+C again in the next 10 seconds. Deferred pause acknowledged. Continuing to checkpoint. ^C Your RunEngine is entering a paused state. These are your options for changing the state of the RunEngine:

resume() --> will resume the scan abort() --> will kill the scan with an 'aborted' state to indicate the scan was interrupted stop() --> will kill the scan with a 'finished' state to indicate the scan stopped normally

Pro Tip: Next time, if you want to abort, tap Ctrl+C three times quickly.

Pausing...

In [190]: stop() Stopping...

1 minute later

abort() ^CA 'deferred pause' has been requested. The RunEngine will pause at the next checkpoint. To pause immediately, hit Ctrl+C again in the next 10 seconds. Deferred pause acknowledged. Continuing to checkpoint. ^C Your RunEngine is entering a paused state. These are your options for changing the state of the RunEngine:

resume() --> will resume the scan abort() --> will kill the scan with an 'aborted' state to indicate the scan was interrupted stop() --> will kill the scan with a 'finished' state to indicate the scan stopped normally

Pro Tip: Next time, if you want to abort, tap Ctrl+C three times quickly.

Pausing...

In [191]: KeyboardInterrupt

In [191]: abort() Aborting.... generator ct ['44407f'](scan num: 65693)

In [192]:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/NSLS-II/Bug-Reports/issues/121#issuecomment-227276046, or mute the threadhttps://github.com/notifications/unsubscribe/ACLIrow8HGOfLTOV2q2KzJ9CvSbsKHHsks5qNwbOgaJpZM4I5yJG.

tacaswell commented 8 years ago

This has been addressed by