Bluesky crash on 8idi:scaler1.CNT

qzhang234 commented 2 years ago

Bluesky crashed after 895 repeats at 05/22 4 am.

The error message from the IPython terminal and the last section of the logger file are attached.

20220522-020044_4am.txt Terminal_05_22_4am.txt

qzhang234 commented 2 years ago

Bluesky crashed at 3 pm on 05/22. Error message attached.

Terminal_Output.txt.txt Logger_File.txt

qzhang234 commented 2 years ago

Same crash message again at 5 pm

Terminal_Output.txt Logger_Output.txt

qzhang234 commented 2 years ago

Same error message at 6 pm

Terminal_Output.txt Logger_Output.txt

prjemian commented 2 years ago

Something very systematic going on here. But happens randomly.

Also the Logger_Output.txt files are actually SPEC data files. The logger output files are in a subfolder.

qzhang234 commented 2 years ago

Also the Logger_Output.txt files are actually SPEC data files. The logger output files are in a subfolder.

@prjemian My apologies, you showed me the directory before. Is this the one?

prjemian commented 2 years ago

Yes

On Sun, May 22, 2022, 6:49 PM Qingteng Zhang @.***> wrote:

Also the Logger_Output.txt files are actually SPEC data files. The logger output files are in a subfolder.

@prjemian https://github.com/prjemian My apologies, you showed me the directory before. Is this the one?

[image: image] https://user-images.githubusercontent.com/48140482/169721381-0a474111-fa73-4d85-9220-1b374926576c.png

— Reply to this email directly, view it on GitHub https://github.com/aps-8id-dys/ipython-8idiuser/issues/287#issuecomment-1134023600, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMUMDUOOG5EEX4EANBPILVLLBZVANCNFSM5WTUQXXQ . You are receiving this because you were mentioned.Message ID: @.***>

prjemian commented 2 years ago

https://github.com/aps-8id-dys/ipython-8idiuser/blob/3e776833212dafd00210712fa97132b8431f3702/profile_bluesky/startup/instrument/devices/scaler.py#L101-L108

prjemian commented 2 years ago

AutoCount may be the problem.

prjemian commented 2 years ago

Thinking this should be "OneShot" instead. Agree? Clipboard01

qzhang234 commented 2 years ago

@prjemian I agree. I have modified the code accordingly (See code attached)

Also attached is a screenshot of the current code behavior.

Hopefully this looks good?

qzhang234 commented 2 years ago

@prjemian Bluesky crashed again at measurement 387/20000 on the same error despite fix of count mode to OneShot in 8idi:scaler1.

Terminal_Output.txt

Any suggestions?

prjemian commented 2 years ago

Hmmm. Let's add some diagnostics just before this line:

  File "/home/beams10/8IDIUSER/.ipython-bluesky/profile_bluesky/startup/instrument/plans/xpcs_acquire.py", line 282, in full_acquire_procedure
    yield from bps.mv(scaler1.count, "Count")
  File "/home/beams/8IDIUSER/.conda/envs/bluesky_2022_2/lib/python3.9/site-packages/bluesky/plan_stubs.py", line 258, in mv

Here's a couple things to print (edited):

print(f"DIAGNOSTIC ({__name__},full_acquire_procedure): scaler1.stage_sigs={scaler1.stage_sigs}")
print(f"DIAGNOSTIC ({__name__},full_acquire_procedure): scaler1._staged={scaler1._staged}")
print(f"DIAGNOSTIC ({__name__},full_acquire_procedure): scaler1.count={scaler1.count}")
print(f"DIAGNOSTIC ({__name__},full_acquire_procedure): scaler1.count_mode={scaler1.count_mode}")

qzhang234 commented 2 years ago

@prjemian I added the diagnostics you recommended. Here's the terminal output. Does this look good?

Terminal_Output.txt

prjemian commented 2 years ago

Yes, that looks good. It's not a fix yet. Hopefully, we have better information when this happens again.

DIAGNOSTIC (instrument.plans.xpcs_acquire,full_acquire_procedure): scaler1.stage_sigs=OrderedDict([('count_mode', 'OneShot'), ('auto_count_time', 0.1)])
DIAGNOSTIC (instrument.plans.xpcs_acquire,full_acquire_procedure): scaler1._staged=Staged.yes
DIAGNOSTIC (instrument.plans.xpcs_acquire,full_acquire_procedure): scaler1.count=EpicsSignal(read_pv='8idi:scaler1.CNT', name='scaler1_count', parent='scaler1', value=1, timestamp=1653335993.658818, auto_monitor=True, string=False, write_pv='8idi:scaler1.CNT', limits=False, put_complete=False)
DIAGNOSTIC (instrument.plans.xpcs_acquire,full_acquire_procedure): scaler1.count_mode=EpicsSignal(read_pv='8idi:scaler1.CONT', name='scaler1_count_mode', parent='scaler1', value='OneShot', timestamp=1653335951.275486, auto_monitor=True, string=True, write_pv='8idi:scaler1.CONT', limits=False, put_complete=False)

qzhang234 commented 2 years ago

@prjemian Bluesky crashed after 863 measurements after the diagnostics and the OneShot staging method were implemented. The error message is attached.

Terminal_05_24.txt

prjemian commented 2 years ago

Let's work on this scaler configuration issue without the AD_Acquire() plan.

prjemian commented 2 years ago

Something like this:

def issue287(num=1):
    t0 = time.time()
    for i in range(num):
        print(f"iteration {i+1} of {num}")
        yield from bp.count([scaler1])
    dt = time.time() - t0
    print(f"elapsed {dt:.3f}  ({dt/num:.4f} s/iteration)")

RE(issue287(100))

prjemian commented 2 years ago

Also, we could try not using the CA monitors when counting this scaler.

prjemian commented 2 years ago

Here is an example when I run this (on pearl):

In [53]: def issue287(num=1):
    ...:     t0 = time.time()
    ...:     for i in range(num):
    ...:         yield from bp.count([scaler1])
    ...:         dt = time.time() - t0
    ...:         print(f"iteration {i+1} of {num}: elapsed {dt:.2f}  ({dt/(i+1):.6f} s/iteration)")
    ...: 

In [54]: uids = RE(issue287(1_500))
iteration 1 of 1500: elapsed 0.14  (0.143082 s/iteration)
iteration 2 of 1500: elapsed 0.28  (0.140344 s/iteration)
iteration 3 of 1500: elapsed 0.44  (0.147266 s/iteration)
iteration 4 of 1500: elapsed 0.58  (0.144830 s/iteration)
iteration 5 of 1500: elapsed 0.74  (0.148361 s/iteration)
iteration 6 of 1500: elapsed 0.88  (0.146338 s/iteration)
iteration 7 of 1500: elapsed 1.04  (0.148956 s/iteration)
iteration 8 of 1500: elapsed 1.18  (0.147759 s/iteration)
iteration 9 of 1500: elapsed 1.34  (0.148888 s/iteration)
iteration 10 of 1500: elapsed 1.48  (0.147937 s/iteration)
iteration 11 of 1500: elapsed 1.64  (0.149289 s/iteration)
iteration 12 of 1500: elapsed 1.78  (0.148284 s/iteration)
iteration 13 of 1500: elapsed 2.12  (0.163308 s/iteration)
iteration 14 of 1500: elapsed 2.26  (0.161399 s/iteration)
iteration 15 of 1500: elapsed 2.42  (0.161523 s/iteration)
iteration 16 of 1500: elapsed 2.56  (0.159936 s/iteration)
iteration 17 of 1500: elapsed 3.24  (0.190487 s/iteration)
iteration 18 of 1500: elapsed 3.38  (0.187550 s/iteration)
iteration 19 of 1500: elapsed 3.54  (0.186348 s/iteration)
iteration 20 of 1500: elapsed 3.71  (0.185709 s/iteration)
iteration 21 of 1500: elapsed 3.85  (0.183464 s/iteration)
iteration 22 of 1500: elapsed 4.02  (0.182556 s/iteration)
iteration 23 of 1500: elapsed 4.16  (0.180680 s/iteration)
iteration 24 of 1500: elapsed 4.32  (0.179806 s/iteration)
iteration 25 of 1500: elapsed 4.45  (0.177991 s/iteration)
iteration 26 of 1500: elapsed 4.58  (0.176266 s/iteration)
iteration 27 of 1500: elapsed 4.74  (0.175527 s/iteration)
iteration 28 of 1500: elapsed 4.87  (0.174056 s/iteration)
iteration 29 of 1500: elapsed 5.13  (0.177067 s/iteration)

If you notice, the average time per iteration decreases mostly, but jumps upwards every so often (like a sawtooth wave). Not sure I can explain that but wonder if that is some kind if indication.

prjemian commented 2 years ago

Here's a plot of the average time per iteration for the first 200 iterations: Clipboard01

prjemian commented 2 years ago

Another view (with a new set of measurements) of the time between successive calls to bp.count(): Clipboard01

This (random occurrence of taking a lot of extra time), I believe, is an indication of the problem that bluesky must be seeing. The status object fails (repeatedly on PV 8idi:scaler1.CNT) for some reason we do not yet know. There are no other EPICS PVs or signals involved. Not even any custom staging. Just the bare, out-of-the-box experience with bluesky, ophyd, and 8idi:scaler.

By the way, I'm running an ipython session on pearl that has a minimal bluesky setup (not the full account, no databroker,, no data saving either since I do not believe that to be related to the problem we are seeing).

from bluesky import plans as bp
from bluesky import plan_stubs as bps
from bluesky import RunEngine
from ophyd.scaler import ScalerCH

scaler = ScalerCH("8idi:scaler1", name="scaler")
RE = RunEngine({})

prjemian commented 2 years ago

https://github.com/aps-8id-dys/ipython-8idiuser/blob/2728b4fcf63ef83544c4640cfbb0a701228e9bcc/profile_bluesky/startup/instrument/plans/xpcs_acquire.py#L44-L45

So the scaler supplies metadata to the XPCS measurement? When it is triggered from AD_Acquire(), the values from the scaler channels are synchronous with the image. Could we completely ignore triggering the scaler and leave it in auto-count mode? This would provide (reasonably) prompt measures of "scalers and devices such as temperature" without AD_Acquire() being bothered by the status timeouts it is suffering now?

prjemian commented 2 years ago

Does not solve the problem but I wonder if we need this to be a problem at all.

qzhang234 commented 2 years ago

@prjemian Yes, I agree that ignoring scaler1 during the staging is probably more time-efficient than getting to the bottom of this.

How should we execute this idea? By commenting out this line?

Regarding the diagnostics on the counter, looking at the figure, I noticed that none of them show a readout time > 60 s, which is the limit for the timeout. Maybe the failure is more fundamental than the readout time fluctuation shown here?

prjemian commented 2 years ago

I'm thinking a bit different, that AD_Acquire should configure the scaler once (before it starts its acquisition loop) and start the scaler in auto count mode. So, whatever steps are necessary should replace that line you identified. We should also add the named channels from the scaler to the "detectors" we want to read (if they are not already there).

qzhang234 commented 2 years ago

AD_Acquire is nested inside for loop and is called on the order of 20,000 times. Maybe we can add the scaler staging step in the Bluesky plan outside for loop?

For example right after line 22 might be a good location (please refer to the bluesky plan attached. Had to change it to .txt otherwise it doesn't allow me to upload). The loop doesn't start till line 37.

bp_AD_Acquire_Eiger.txt

prjemian commented 2 years ago

Now, I see this online, needs some cleanup: https://github.com/aps-8id-dys/ipython-8idiuser/blob/2728b4fcf63ef83544c4640cfbb0a701228e9bcc/profile_bluesky/startup/instrument/plans/xpcs_acquire.py#L280-L281

prjemian commented 2 years ago

This, too: https://github.com/aps-8id-dys/ipython-8idiuser/blob/2728b4fcf63ef83544c4640cfbb0a701228e9bcc/profile_bluesky/startup/instrument/plans/xpcs_acquire.py#L264

qzhang234 commented 2 years ago

@prjemian Regarding line 281 in xpcs_acquire.py, if we staged the scaler in OneShot mode, it shouldn't matter if we count it every measurement, right?

Also what should we do with line 264? Since it's an empty decorator, maybe comment out the entire line?

prjemian commented 2 years ago

@qzhang234 Thanks for the reminder that AD_Acquire() is the inner loop, called once for each acquisition. So, I propose that the only references to the scaler within that inner loop will be read-type, no configuration or puts. The caller will be responsible for configuring the scaler to update the channels of interest.

prjemian commented 2 years ago

Since that's a lot of responsibility to place in the caller code (none of which is here on GitHub?), why don't you make a plan that does this (configure the scaler so that it acquires all the info it should via auto count and start the auto-count). This plan could be tested completely independent of AD_Aceuire(). The user code would run this plan before it runs AD_Acquire() a gazillion times. Even, a user could copy and customize their own version should needs change.

prjemian commented 2 years ago

Regarding line 281 in xpcs_acquire.py, if we staged the scaler in OneShot mode, it shouldn't matter if we count it every measurement, right?

This is the line that fails on us (by starting a count and its status object has a timeout).

prjemian commented 2 years ago

Also what should we do with line 264? Since it's an empty decorator, maybe comment out the entire line?

Exactly. All staging & triggering of the scaler in AD_Acquire() must be removed.

qzhang234 commented 2 years ago

@prjemian Here are the changes I made:

Staging and triggering of scaler1 is now moved outside for loop in AD_Acquire;
All occurrences of scaler1 are removed from xpcs_acquire.py, including the staging and the triggering.

The short test scan was successful, so I pushed the current code to GitHub branch and restarted the long scan.

Terminal_Output.txt

qzhang234 commented 2 years ago

Hi Pete:

A strange question: How come I don't see the recent two comments from you on the GitHub issue?

Thanks,

QZ

From: Pete R Jemian @.> Sent: Tuesday, May 24, 2022 3:21 PM To: aps-8id-dys/ipython-8idiuser @.> Cc: Zhang, Qingteng @.>; Mention @.> Subject: Re: [aps-8id-dys/ipython-8idiuser] Bluesky crash on 8idi:scaler1.CNT (Issue #287)

These, too? https://github.com/aps-8id-dys/ipython-8idiuser/blob/2728b4fcf63ef83544c4640cfbb0a701228e9bcc/profile_bluesky/startup/instrument/plans/xpcs_acquire.py#L93-L117

— Reply to this email directly, view it on GitHubhttps://github.com/aps-8id-dys/ipython-8idiuser/issues/287#issuecomment-1136394609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALPJBQVNLWWHVWFJOPYA3NLVLU255ANCNFSM5WTUQXXQ. You are receiving this because you were mentioned.Message ID: @.***>

qzhang234 commented 2 years ago

Measurement aborted at 13659/20000. Closing this issue and proceed to Lambda2M definition.

qzhang234 commented 1 year ago

Pete:

Also those lines were commented out on branch 150-Lambda2M, which is up-to-date with local.

Thanks,

QZ

[cid:1fe3a892-c8b3-4695-a27f-5be2cefee309]

[cid:b1c62a95-e9a5-4f70-955e-c179f914b6f1]

From: Zhang, Qingteng @.> Sent: Tuesday, May 24, 2022 4:08 PM To: aps-8id-dys/ipython-8idiuser @.>; aps-8id-dys/ipython-8idiuser @.> Cc: Mention @.> Subject: Re: [aps-8id-dys/ipython-8idiuser] Bluesky crash on 8idi:scaler1.CNT (Issue #287)

Hi Pete:

A strange question: How come I don't see the recent two comments from you on the GitHub issue?

Thanks,

QZ

From: Pete R Jemian @.> Sent: Tuesday, May 24, 2022 3:21 PM To: aps-8id-dys/ipython-8idiuser @.> Cc: Zhang, Qingteng @.>; Mention @.> Subject: Re: [aps-8id-dys/ipython-8idiuser] Bluesky crash on 8idi:scaler1.CNT (Issue #287)

These, too? https://github.com/aps-8id-dys/ipython-8idiuser/blob/2728b4fcf63ef83544c4640cfbb0a701228e9bcc/profile_bluesky/startup/instrument/plans/xpcs_acquire.py#L93-L117

— Reply to this email directly, view it on GitHubhttps://github.com/aps-8id-dys/ipython-8idiuser/issues/287#issuecomment-1136394609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALPJBQVNLWWHVWFJOPYA3NLVLU255ANCNFSM5WTUQXXQ. You are receiving this because you were mentioned.Message ID: @.***>

aps-8id-dys / ipython-8idiuser

Bluesky crash on 8idi:scaler1.CNT #287