Skipping jammed Rigaku measurement and restart run engine?

aps-8id-dys / ipython-8idiuser

8-ID-I ipython configuration for bluesky (and other)

1 stars 1 forks source link

Skipping jammed Rigaku measurement and restart run engine? #230

Closed qzhang234 closed 4 years ago

qzhang234 commented 4 years ago

I noticed the same error on the LabView panel when I ran XSPA with Spec. Guess the jamming we saw in #223 two weeks ago is not a Bluesky problem.

This raises a question: Is there a way to modify Bluesky so that it automatically moves on to the next measurement if XSPA does not respond for more than 60 s (each measurement should take no more than 5 s)?

qzhang234 commented 4 years ago

One thing I was also thinking: is it possible to send out an email from Bluesky when a scan crashed or hung?

prjemian commented 4 years ago

Yes. See EmailNotifications(). (edited) This example is now in the docs:

from apstools.utils import EmailNotifications

SENDER_EMAIL = "8idiuser@aps.anl.gov"
email_notices = EmailNotifications(SENDER_EMAIL)
email_notices.add_addresses(
    "joe.user@anl.gov",
    "instrument_team@aps.anl.gov",
    # others?
)

# then, when some condition occurs
if feedback_limits_approached:
    subject = "Feedback problem"
    message = "Feedback is very close to its limits."
    email_notices.send(subject, message)

prjemian commented 4 years ago

Is there a way to modify Bluesky so that it automatically moves on to the next measurement if XSPA does not respond for more than 60 s (each measurement should take no more than 5 s)?

If we can catch the timeout, for sure we can do this.

prjemian commented 4 years ago

We might want to catch a sequence of n consecutive jams to make sure we do not retry a hopeless situation.

qzhang234 commented 4 years ago

Yes. I would say 3 retries would be enough.

This Rigaku timeout bug has occurred twice with Bluesky in #223 and once with Spec this week. It appears to be a recurring and reproducible problem. Nakaye doesn't know the source of the bug so we'll have to fix it from our end. Hopefully implementing the Bluesky re-throw will permanently fix this bug.

Also now that I think about this, most of the jam or crash when operating Rigaku/Bluesky can be fixed by simply Ctrl+C and restart the plan. Maybe this implementation is the last step towards our milestone of one week of continuous Bluesky user operation.

The beam will be down next Monday (09/28) at 8 am and doesn't come back till Thursday 8 am (10/01). This is a great opportunity, so let's get this done before the beam is back up. @prjemian Please let me know if there's anything that I can help.

Thanks!

qzhang234 commented 4 years ago

The same bug occurred again while running with Spec on 09/27, 11:35 pm. I'm therefore changing the label to 'high priority'.

It looks like our best chance is to run Bluesky for the week of 10/01 - 10/12 with the rethrow capacity implemented.

@prjemian There's no beam till 10/01. Please advise on how we should start working on this. Thanks!

prjemian commented 4 years ago

So, we want to implement a timeout around a call to yield from AD_Acquire().

If timeout, then:

wait for ~5 minutes
retry up to n times
if retries exhausted:
- send email
- detector is not responding so abort the scan.

qzhang234 commented 4 years ago

Just to leave a note that the LabView hang occurred again at 23:33 on 09/29

prjemian commented 4 years ago

Commit 2d2d672 should also handle the ReadTimeout problems affecting #233