aps-8id-dys / ipython-8idiuser

8-ID-I ipython configuration for bluesky (and other)
1 stars 1 forks source link

Bluesky Hung after ~ 750 Rigaku Measurements #173

Closed qzhang234 closed 4 years ago

qzhang234 commented 4 years ago

The Bluesky plan hung at 02:25 am. I checked the status and everything looks fine, so I can't figured out the reason why it hung. @prjemian Would you mind taking a quick look at the record at that time to see what happened?

Also @sureshnaps the large number of job failures on DM Workflow is due to the fact that the .bin files are too large so DM timed out waiting for .bin file to show up on /home/8-id-i-stage. The files are still transferred by robocopy, it just took longer. However this seems to be a limitor on performance because if the sparsification is a lot faster than robocopy and the files keep piling up, eventually it will fill up the SSD and crash the workflow. Anyway we could improve that transfer speed or get NX web to run on Rigaku PC so that it can save directly to /home/8-id-i?

Thanks,

Error message ``` folder ============= K048_D20_att00_35p00C_0749 fname ============= K048_D20_att00_35p00C_0749_00001-100000.bin K048_D20_att00_35p00C_0749/K048_D20_att00_35p00C_0749_00001-100000.bin I Mon-02:25:11 - DM workflow kickoff done I Mon-02:25:11 - b'id=6da31dc0-17eb-4c69-b26a-fe45b6d698bb owner=8idiuser status=pending startTime=1592810711.08 startTimestamp=2020/06/22 02:25:11 CDT \n' New stream: 'primary' +-----------+------------+ | seq_num | time | +-----------+------------+ | 1 | 02:25:17.6 | D Mon-02:25:17 - after count() HDF5 workflow file name: /home/8-id-i/2020-2/comm202006/K048_D20_att00_35p00C_0749/K048_D20_att00_35p00C_0749_0001-100000.hdf +-----------+------------+ generator raster_acq ['3d3e7c05'] (scan num: 8045) I Mon-02:25:17 - DM workflow kickoff starting: analysis:True file:/home/8-id-i/2020-2/comm202006/K048_D20_att00_35p00C_0749/K048_D20_att00_35p00C_0749_0001-100000.hdf K048_D20_att00_35p00C_0750 I Mon-02:25:17 - AD_Acquire starting I Mon-02:25:17 - file_path = /home/8-id-i/2020-2/comm202006/K048_D20_att00_35p00C_0750/ I Mon-02:25:17 - calling full_acquire_procedure() D Mon-02:25:18 - before update_metadata_prescan() I Mon-02:25:18 - detNum=46, det_pars={'ccdHardwareColSize': 512, 'ccdHardwareRowSize': 1024, 'ccdxsense': 1, 'ccdzsense': 1, 'harmonic': 1, 'dpix': 0.076, 'saturation': 3, 'adupphot': 1.0, 'efficiency': 1.0, 'gain': 1, 'blemish': 1, 'flatfield': 0, 'distortion': 0, 'parasitic': 0, 'lld': 0, 'manufacturer': 'RIGAKU500K_NoGap'} D Mon-02:25:18 - after update_metadata_prescan() D Mon-02:25:18 - supplied metadata = {'ARun_number': 'K048_D20_att00_35p00C_0750', 'plan_args': {'detector_name': 'rigaku', 'acquire_time': 0.1, 'acquire_period': 0.11, 'num_images': 100000, 'file_name': 'K048_D20_att00_35p00C_0750', 'submit_xpcs_job': 'True', 'path': '/home/8-id-i/2020-2/comm202006/'}} D Mon-02:25:18 - file_name = K048_D20_att00_35p00C_0750 D Mon-02:25:18 - file_path = /home/8-id-i/2020-2/comm202006/K048_D20_att00_35p00C_0750/ D Mon-02:25:18 - metadata = {'file_name': 'K048_D20_att00_35p00C_0750', 'file_path': '/home/8-id-i/2020-2/comm202006/K048_D20_att00_35p00C_0750/', 'ARun_number': 'K048_D20_att00_35p00C_0750', 'plan_args': {'detector_name': 'rigaku', 'acquire_time': 0.1, 'acquire_period': 0.11, 'num_images': 100000, 'file_name': 'K048_D20_att00_35p00C_0750', 'submit_xpcs_job': 'True', 'path': '/home/8-id-i/2020-2/comm202006/'}} I Mon-02:25:18 - scaler should be autocounting now D Mon-02:25:18 - before count() Transient Scan ID: 8046 Time: 2020-06-22 02:25:18 Persistent Unique Scan ID: 'e786fad0-0d4c-49e3-9cbe-56d8e9dbdd2a' New stream: 'baseline' New stream: 'aps_current_monitor' New stream: 'Timebase_monitor' New stream: 'pind1_monitor' New stream: 'pind2_monitor' New stream: 'Atten1_monitor' New stream: 'Atten2_monitor' New stream: 'lakeshore_loop3_temperature_monitor' New stream: 'lakeshore_loop3_target_monitor' folder ============= K048_D20_att00_35p00C_0750 fname ============= K048_D20_att00_35p00C_0750_00001-100000.bin K048_D20_att00_35p00C_0750/K048_D20_att00_35p00C_0750_00001-100000.bin I Mon-02:25:18 - DM workflow kickoff done I Mon-02:25:18 - b'id=3512b91b-8d52-407a-811d-8a77ceac10d8 owner=8idiuser status=pending startTime=1592810718.15 startTimestamp=2020/06/22 02:25:18 CDT \n' Suspending....To get prompt hit Ctrl-C twice to pause. Suspension occurred at 2020-06-22 09:40:44. Justification for this suspension: Signal aps_current = -0.005005044270668009 is below 2 A 'deferred pause' has been requested. The RunEngine will pause at the next checkpoint. To pause immediately, hit Ctrl+C again in the next 10 seconds. Deferred pause acknowledged. Continuing to checkpoint. ```
prjemian commented 4 years ago

It says:

I Mon-02:25:18 - b'id=3512b91b-8d52-407a-811d-8a77ceac10d8 owner=8idiuser status=pending startTime=1592810718.15 startTimestamp=2020/06/22 02:25:18 CDT \n'
Suspending....To get prompt hit Ctrl-C twice to pause.                                                                        
Suspension occurred at 2020-06-22 09:40:44.
Justification for this suspension:
Signal aps_current = -0.005005044270668009 is below 2
  A 'deferred pause' has been requested. The RunEngine will pause at the next checkpoint. To pause immediately, hit Ctrl+C again in the next 10 seconds.
Deferred pause acknowledged. Continuing to checkpoint.

When I looked at the Storage Ring History (which included the time of the report), the chart did not indicate the storage ring current had dropped to zero (or slightly below). Did the scan continue when the reported current rose above the restart threshold (I believe 10 mA)?

prjemian commented 4 years ago

The report about large number of job failures on DM Workflow belongs in a separate issue.

prjemian commented 4 years ago

Since it does not look as if the beam dumped at 2:25 AM on Monday morning, I wonder if this was a one-time transient occurrence or will this happen again? If it happens again, we must follow up with the ones who provide the S:SRcurrentAI PV that provides our ophyd signal of aps.current.

Until then, I will close this assuming it is a freak occurrence.