Closed qzhang234 closed 2 years ago
Bluesky crashed at 3 pm on 05/22. Error message attached.
Same crash message again at 5 pm
Same error message at 6 pm
Something very systematic going on here. But happens randomly.
Also the Logger_Output.txt
files are actually SPEC data files. The logger output files are in a subfolder.
Also the
Logger_Output.txt
files are actually SPEC data files. The logger output files are in a subfolder.
@prjemian My apologies, you showed me the directory before. Is this the one?
Yes
On Sun, May 22, 2022, 6:49 PM Qingteng Zhang @.***> wrote:
Also the Logger_Output.txt files are actually SPEC data files. The logger output files are in a subfolder.
@prjemian https://github.com/prjemian My apologies, you showed me the directory before. Is this the one?
[image: image] https://user-images.githubusercontent.com/48140482/169721381-0a474111-fa73-4d85-9220-1b374926576c.png
— Reply to this email directly, view it on GitHub https://github.com/aps-8id-dys/ipython-8idiuser/issues/287#issuecomment-1134023600, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMUMDUOOG5EEX4EANBPILVLLBZVANCNFSM5WTUQXXQ . You are receiving this because you were mentioned.Message ID: @.***>
AutoCount may be the problem.
Thinking this should be "OneShot"
instead. Agree?
@prjemian I agree. I have modified the code accordingly (See code attached)
Also attached is a screenshot of the current code behavior.
Hopefully this looks good?
@prjemian Bluesky crashed again at measurement 387/20000 on the same error despite fix of count mode to OneShot
in 8idi:scaler1
.
Any suggestions?
Hmmm. Let's add some diagnostics just before this line:
File "/home/beams10/8IDIUSER/.ipython-bluesky/profile_bluesky/startup/instrument/plans/xpcs_acquire.py", line 282, in full_acquire_procedure
yield from bps.mv(scaler1.count, "Count")
File "/home/beams/8IDIUSER/.conda/envs/bluesky_2022_2/lib/python3.9/site-packages/bluesky/plan_stubs.py", line 258, in mv
Here's a couple things to print (edited):
print(f"DIAGNOSTIC ({__name__},full_acquire_procedure): scaler1.stage_sigs={scaler1.stage_sigs}")
print(f"DIAGNOSTIC ({__name__},full_acquire_procedure): scaler1._staged={scaler1._staged}")
print(f"DIAGNOSTIC ({__name__},full_acquire_procedure): scaler1.count={scaler1.count}")
print(f"DIAGNOSTIC ({__name__},full_acquire_procedure): scaler1.count_mode={scaler1.count_mode}")
@prjemian I added the diagnostics you recommended. Here's the terminal output. Does this look good?
Yes, that looks good. It's not a fix yet. Hopefully, we have better information when this happens again.
DIAGNOSTIC (instrument.plans.xpcs_acquire,full_acquire_procedure): scaler1.stage_sigs=OrderedDict([('count_mode', 'OneShot'), ('auto_count_time', 0.1)])
DIAGNOSTIC (instrument.plans.xpcs_acquire,full_acquire_procedure): scaler1._staged=Staged.yes
DIAGNOSTIC (instrument.plans.xpcs_acquire,full_acquire_procedure): scaler1.count=EpicsSignal(read_pv='8idi:scaler1.CNT', name='scaler1_count', parent='scaler1', value=1, timestamp=1653335993.658818, auto_monitor=True, string=False, write_pv='8idi:scaler1.CNT', limits=False, put_complete=False)
DIAGNOSTIC (instrument.plans.xpcs_acquire,full_acquire_procedure): scaler1.count_mode=EpicsSignal(read_pv='8idi:scaler1.CONT', name='scaler1_count_mode', parent='scaler1', value='OneShot', timestamp=1653335951.275486, auto_monitor=True, string=True, write_pv='8idi:scaler1.CONT', limits=False, put_complete=False)
@prjemian Bluesky crashed after 863 measurements after the diagnostics and the OneShot
staging method were implemented. The error message is attached.
Let's work on this scaler configuration issue without the AD_Acquire()
plan.
Something like this:
def issue287(num=1):
t0 = time.time()
for i in range(num):
print(f"iteration {i+1} of {num}")
yield from bp.count([scaler1])
dt = time.time() - t0
print(f"elapsed {dt:.3f} ({dt/num:.4f} s/iteration)")
RE(issue287(100))
Also, we could try not using the CA monitors when counting this scaler.
Here is an example when I run this (on pearl):
In [53]: def issue287(num=1):
...: t0 = time.time()
...: for i in range(num):
...: yield from bp.count([scaler1])
...: dt = time.time() - t0
...: print(f"iteration {i+1} of {num}: elapsed {dt:.2f} ({dt/(i+1):.6f} s/iteration)")
...:
In [54]: uids = RE(issue287(1_500))
iteration 1 of 1500: elapsed 0.14 (0.143082 s/iteration)
iteration 2 of 1500: elapsed 0.28 (0.140344 s/iteration)
iteration 3 of 1500: elapsed 0.44 (0.147266 s/iteration)
iteration 4 of 1500: elapsed 0.58 (0.144830 s/iteration)
iteration 5 of 1500: elapsed 0.74 (0.148361 s/iteration)
iteration 6 of 1500: elapsed 0.88 (0.146338 s/iteration)
iteration 7 of 1500: elapsed 1.04 (0.148956 s/iteration)
iteration 8 of 1500: elapsed 1.18 (0.147759 s/iteration)
iteration 9 of 1500: elapsed 1.34 (0.148888 s/iteration)
iteration 10 of 1500: elapsed 1.48 (0.147937 s/iteration)
iteration 11 of 1500: elapsed 1.64 (0.149289 s/iteration)
iteration 12 of 1500: elapsed 1.78 (0.148284 s/iteration)
iteration 13 of 1500: elapsed 2.12 (0.163308 s/iteration)
iteration 14 of 1500: elapsed 2.26 (0.161399 s/iteration)
iteration 15 of 1500: elapsed 2.42 (0.161523 s/iteration)
iteration 16 of 1500: elapsed 2.56 (0.159936 s/iteration)
iteration 17 of 1500: elapsed 3.24 (0.190487 s/iteration)
iteration 18 of 1500: elapsed 3.38 (0.187550 s/iteration)
iteration 19 of 1500: elapsed 3.54 (0.186348 s/iteration)
iteration 20 of 1500: elapsed 3.71 (0.185709 s/iteration)
iteration 21 of 1500: elapsed 3.85 (0.183464 s/iteration)
iteration 22 of 1500: elapsed 4.02 (0.182556 s/iteration)
iteration 23 of 1500: elapsed 4.16 (0.180680 s/iteration)
iteration 24 of 1500: elapsed 4.32 (0.179806 s/iteration)
iteration 25 of 1500: elapsed 4.45 (0.177991 s/iteration)
iteration 26 of 1500: elapsed 4.58 (0.176266 s/iteration)
iteration 27 of 1500: elapsed 4.74 (0.175527 s/iteration)
iteration 28 of 1500: elapsed 4.87 (0.174056 s/iteration)
iteration 29 of 1500: elapsed 5.13 (0.177067 s/iteration)
If you notice, the average time per iteration decreases mostly, but jumps upwards every so often (like a sawtooth wave). Not sure I can explain that but wonder if that is some kind if indication.
Here's a plot of the average time per iteration for the first 200 iterations:
Another view (with a new set of measurements) of the time between successive calls to bp.count()
:
This (random occurrence of taking a lot of extra time), I believe, is an indication of the problem that bluesky must be seeing. The status object fails (repeatedly on PV 8idi:scaler1.CNT
) for some reason we do not yet know. There are no other EPICS PVs or signals involved. Not even any custom staging. Just the bare, out-of-the-box experience with bluesky, ophyd, and 8idi:scaler
.
By the way, I'm running an ipython session on pearl that has a minimal bluesky setup (not the full account, no databroker,, no data saving either since I do not believe that to be related to the problem we are seeing).
from bluesky import plans as bp
from bluesky import plan_stubs as bps
from bluesky import RunEngine
from ophyd.scaler import ScalerCH
scaler = ScalerCH("8idi:scaler1", name="scaler")
RE = RunEngine({})
So the scaler supplies metadata to the XPCS measurement? When it is triggered from AD_Acquire()
, the values from the scaler channels are synchronous with the image. Could we completely ignore triggering the scaler and leave it in auto-count mode? This would provide (reasonably) prompt measures of "scalers and devices such as temperature" without AD_Acquire()
being bothered by the status timeouts it is suffering now?
Does not solve the problem but I wonder if we need this to be a problem at all.
@prjemian Yes, I agree that ignoring scaler1
during the staging is probably more time-efficient than getting to the bottom of this.
How should we execute this idea? By commenting out this line?
Regarding the diagnostics on the counter, looking at the figure, I noticed that none of them show a readout time > 60 s, which is the limit for the timeout. Maybe the failure is more fundamental than the readout time fluctuation shown here?
I'm thinking a bit different, that AD_Acquire should configure the scaler once (before it starts its acquisition loop) and start the scaler in auto count mode. So, whatever steps are necessary should replace that line you identified. We should also add the named channels from the scaler to the "detectors" we want to read (if they are not already there).
AD_Acquire
is nested inside for
loop and is called on the order of 20,000 times. Maybe we can add the scaler staging step in the Bluesky plan outside for
loop?
For example right after line 22 might be a good location (please refer to the bluesky plan attached. Had to change it to .txt otherwise it doesn't allow me to upload). The loop doesn't start till line 37.
Now, I see this online, needs some cleanup: https://github.com/aps-8id-dys/ipython-8idiuser/blob/2728b4fcf63ef83544c4640cfbb0a701228e9bcc/profile_bluesky/startup/instrument/plans/xpcs_acquire.py#L280-L281
@prjemian Regarding line 281 in xpcs_acquire.py
, if we staged the scaler in OneShot
mode, it shouldn't matter if we count it every measurement, right?
Also what should we do with line 264? Since it's an empty decorator, maybe comment out the entire line?
@qzhang234 Thanks for the reminder that AD_Acquire()
is the inner loop, called once for each acquisition. So, I propose that the only references to the scaler within that inner loop will be read-type, no configuration or puts. The caller will be responsible for configuring the scaler to update the channels of interest.
Since that's a lot of responsibility to place in the caller code (none of which is here on GitHub?), why don't you make a plan that does this (configure the scaler so that it acquires all the info it should via auto count and start the auto-count). This plan could be tested completely independent of AD_Aceuire()
. The user code would run this plan before it runs AD_Acquire()
a gazillion times. Even, a user could copy and customize their own version should needs change.
Regarding line 281 in
xpcs_acquire.py
, if we staged the scaler inOneShot
mode, it shouldn't matter if we count it every measurement, right?
This is the line that fails on us (by starting a count and its status object has a timeout).
Also what should we do with line 264? Since it's an empty decorator, maybe comment out the entire line?
Exactly. All staging & triggering of the scaler in AD_Acquire()
must be removed.
@prjemian Here are the changes I made:
scaler1
is now moved outside for
loop in AD_Acquire
;scaler1
are removed from xpcs_acquire.py
, including the staging and the triggering.The short test scan was successful, so I pushed the current code to GitHub branch and restarted the long scan.
Hi Pete:
A strange question: How come I don't see the recent two comments from you on the GitHub issue?
Thanks,
QZ
From: Pete R Jemian @.> Sent: Tuesday, May 24, 2022 3:21 PM To: aps-8id-dys/ipython-8idiuser @.> Cc: Zhang, Qingteng @.>; Mention @.> Subject: Re: [aps-8id-dys/ipython-8idiuser] Bluesky crash on 8idi:scaler1.CNT (Issue #287)
— Reply to this email directly, view it on GitHubhttps://github.com/aps-8id-dys/ipython-8idiuser/issues/287#issuecomment-1136394609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALPJBQVNLWWHVWFJOPYA3NLVLU255ANCNFSM5WTUQXXQ. You are receiving this because you were mentioned.Message ID: @.***>
Measurement aborted at 13659/20000. Closing this issue and proceed to Lambda2M definition.
Pete:
Also those lines were commented out on branch 150-Lambda2M, which is up-to-date with local.
Thanks,
QZ
[cid:1fe3a892-c8b3-4695-a27f-5be2cefee309]
[cid:b1c62a95-e9a5-4f70-955e-c179f914b6f1]
From: Zhang, Qingteng @.> Sent: Tuesday, May 24, 2022 4:08 PM To: aps-8id-dys/ipython-8idiuser @.>; aps-8id-dys/ipython-8idiuser @.> Cc: Mention @.> Subject: Re: [aps-8id-dys/ipython-8idiuser] Bluesky crash on 8idi:scaler1.CNT (Issue #287)
Hi Pete:
A strange question: How come I don't see the recent two comments from you on the GitHub issue?
Thanks,
QZ
From: Pete R Jemian @.> Sent: Tuesday, May 24, 2022 3:21 PM To: aps-8id-dys/ipython-8idiuser @.> Cc: Zhang, Qingteng @.>; Mention @.> Subject: Re: [aps-8id-dys/ipython-8idiuser] Bluesky crash on 8idi:scaler1.CNT (Issue #287)
— Reply to this email directly, view it on GitHubhttps://github.com/aps-8id-dys/ipython-8idiuser/issues/287#issuecomment-1136394609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALPJBQVNLWWHVWFJOPYA3NLVLU255ANCNFSM5WTUQXXQ. You are receiving this because you were mentioned.Message ID: @.***>
Bluesky crashed after 895 repeats at 05/22 4 am.
The error message from the IPython terminal and the last section of the logger file are attached.
20220522-020044_4am.txt Terminal_05_22_4am.txt