Intentionality in (not) bypassing the queueserver

padraic-shafer commented 9 months ago

This is not a package issue per se; it’s a bluesky integration consideration that converges at the queueserver. I want to capture a discussion in progress, and we could move this to a more appropriate venue if someone prefers.

TL;DR: Bypassing the queueserver to control hardware should be discouraged OR an intentional, transparent decision that provide some safeguards.

—- I’ve been discussing with @dylanmcreynolds and @danielballan the importance of having only one gateway for controlling the state of beamline/endstation during beamtimes. Queueserver naturally fills this role. I think we don’t have way to enforce this restriction currently. Rather than passing plans to queueserver, someone could concurrently access the RunEngine directly at a beamline terminal, or by commanding a movable ophyd device, or by setting an EPICS PV, etc. There are potentially valid reasons for having these multiple pathways, but if left unchecked this is a recipe for chaos for multi-user/multi-client interactions with the beamline.

I don’t propose we try to solve that larger problem here, but there are a few thoughts on the interplay between ophyd, RunEngine, and queueserver that I would like to share and get feedback on.

Even simple motor moves should be plans sent to queueserver. a. This ensures that motor moves do not bypass the queue or preempt the intentions of the experimenters. b. In the context of a bluesky run, ophyd is effectively a middle layer—an implementation detail that I think we would not want users to fiddle with directly. c. This concern has particular relevance for controls GUIs where the designer of the client application might be tempted to run scans through queueserver and provide a motor widget that directly accesses ophyd objects.
What if I intentionally want to jog motors while a plan is running? a. @danielballan noted that it’s a common paradigm to set a detector in continuous capture mode while one or motors are jogged. If the detector is started as a bluesky run, then the motor must be moved by ophyd (or EPICS, or something outside of bluesky). b. This mode should be supported. If it ends up being the only notable use case, then perhaps there should be a dedicated escape hatch built into the RunEngine (and the queueserver by extension) to handle this. c. Maybe this is just a special case of a flyer plan?
Auto-generated GUI (e.g. Typhos) instantiates ophyd objects a. We should not destroy the functionality of packages that create GUI widgets for moving motors and setting signals, which do so by inspecting a database of ophyd objects and connecting to them. b. We could consider a proxy object that supports this use, while still enqueuing the motor moves. Example An ophyd interface that delegates motor moves to an ophyd motor indirectly by adding that move as a queued plan, and perhaps inspects the queue as part of the readback status.
Alignment/troubleshooting vs. data collection a. There are times, where using a terminal that directly communicates with hardware or with the RunEngine is simply more convenient. This might be for troubleshooting or for rapid alignments before starting the collection of publication-quality data. b. To enable this sensibly, there should probably some mechanism for putting the beamline in a “manual” state—pause the queueserver and pass control to an admin user.

padraic-shafer commented 9 months ago

Tagging several people here for opinions. If there is interest we consider scheduling a community call on this topic.

@tacaswell @mrakitin @dmgav @taxe10 @coretl @callumforrester @rodolakis @prjemian @ZLLentz @klauer @whs92 @clintonroy @ksunden @untzag

whs92 commented 9 months ago

We are very interested in this topic and would welcome a call.

coretl commented 9 months ago

What if I intentionally want to jog motors while a plan is running?

We are explicitly separating the scanning layer which should only be able to run one plan at a time, from the controls layer which is where the GUI generation, PV live updates and archiving happens. The majority of these jogging moves will be done by beamline staff, who will have read-write access to PVs, so they can use the controls GUIs during a scan. However, we intend to turn the controls GUIs read-only for users, and they occasionally need to jog motors during a plan. For this case we could either turn specific PVs read-write for users, or create a second run engine with some motors in it for this live access. I'm not sure which we will go with at the moment...

dmgav commented 9 months ago

I just wanted to note that the features needed for (1), (2) and (4) already exist in the Queue Server. It was discussed earlier that there should be a separate service for monitoring and direct control of PVs. There was a project started by @untzag to implement the service, but I can not find the repository.

untzag commented 9 months ago

You're thinking of @ksunden's project,

https://github.com/ksunden/bluesky-hwproxy

I only had some design input :smile:

For what it's worth we've been using bluesky-hwproxy for more than a year now, it's just perfect for our small-potatoes workflow. It does exactly what @padraic-shafer suggests by making bluesky-protocol compliant objects on the client side---you can just plug it in to anything that expects such an API. We made it to do exactly what you are talking about---keep existing "fancy" GUI features alive in a queueserver world.

importance of having only one gateway for controlling the state of beamline/endstation during beamtimes

I'm not sure exactly why this is so important, but it seems to me that the proxy could be written to support set operations routed via queueserver.

One of the core design features of queueserver is the disposable worker process. Queueserver has no mechanism for interacting with the hardware between plan runs, as far as I know... maybe the newer support for arbitrary ipython stuff changes that story? The important thing about hwproxy, for us, is that it's always online and ready to serve fairly up-to-date hardware state information. It's a lot easier to write clients if you assume that there is always a reliable place to grab current hardware state.

I still think that message caching should be carefully considered as another approach that might work better for larger deployments. If every change to hardware state is captured through the runengine, you should be able to seek backwards through the documents and piece together the current state of the instrument.

dmgav commented 9 months ago

My understanding is that item (3) refers to a web tool for monitoring and/or modifying PVs (scaled-down version of CSS). It involves continuous monitoring of large number of PVs and should not be done in the same process with running plans. This is the reason for https://github.com/ksunden/bluesky-hwproxy. The existing features of the queue server allow to implement workflows that require occasional access to PVs (e.g. to read a single value), but it is not recommended to use it for continuous monitoring.

dmgav commented 9 months ago

Jupyter Console can not be used to interfere with a running plan started by the Queue Server, since IPython kernel can execute only one cell at a time and the console stays unresponsive when the server is executing a plan or some other task. Jupyter Console works only when the server is idle or a plan is paused. Jupyter Console can not connect to a worker using pure Python (which is the default mode).

ksunden commented 9 months ago

To be clear, hwproxy currently is explicitly limited to readonly access. It's goal was to communicate things like limits/current position to control software without having to run a plan (and therefore interrupt a currently running plan). It does not provide a mechanism to set values (though presumably it could, just would require more careful consideration than we needed for our task)

Note that under the hood it is just creating parallel connections to hardware, so if an individual hardware requires that the bluesky object is in process and is not controlled by EPICS/yaq/etc. which allow multiple connections to the underlying device, then it won't work.

In practice, we have a program that knows nothing of bluesky at all that provides the "Engineering Interface" for non scan/data collection setup and initial tuning

There is nothing actually stopping you from using that mid-scan, but it is recommended against. Though if e.g. you notice your shutter is still closed 10 seconds into an hour long scan (in a region without interesting data in the first place), then you can just open the shutter using the other program without interrupting the scan.

padraic-shafer commented 9 months ago

For this case we could either turn specific PVs read-write for users, or create a second run engine with some motors in it for this live access. I'm not sure which we will go with at the moment...

I think the second option (run engine + environment) has an advantage in that it could be generic--R/W access controlled at the ophyd level because it might not be EPICS under the hood.

padraic-shafer commented 9 months ago

importance of having only one gateway for controlling the state of beamline/endstation during beamtimes

I'm not sure exactly why this is so important, but it seems to me that the proxy could be written to support set operations routed via queueserver.

I think that captures the essence. There could be multiple entry points for reading, but requests to mutate the state need to be serialized in some way. It might just be an agreement among a beam time user group to discuss while they are operating the beamline, but user_1 (at home) should probably be prevented from jogging the sample position or energy while user_2 (in hotel room) is running scans. When everyone is in the same room, this is a lot simpler; but we should not assume that is the case.

But, as mentioned above, there are cases where user_1 should be able to jog the motor while user_1 is running a scan. Maybe that's a useful cue, that only one user is in charge at any time. It's a discussion for another day what to do when user falls asleep during an overnight scan and user_2 needs to take charge. :) E.g., QS has a mechanism for unlocking the queue with a password that can be shared amongst the team.

padraic-shafer commented 9 months ago

I'm encouraged by the discussions of hwproxy above that multiple coexisting services could be a useful way to go. This suggests that a higher level coordinating service or gateway is then needed.

dmgav commented 9 months ago

https://github.com/ksunden/bluesky-hwproxy is reading PVs using Ophyd objects. I don't think wrapping each PV read into a plan and calling Run Engine would provide any benifit.

dmgav commented 9 months ago

This suggests that a higher level coordinating service or gateway is then needed.

https://github.com/bluesky/bluesky-httpserver could be extended to coordinate multiple low level services. Now it is simply forwarding REST API requests to the Queue Server, but the original intention was that it will implement higher level API.

padraic-shafer commented 9 months ago

I'm glad to see that many of you are thinking about these same considerations. It sounds like it would be useful to have a live discussion / presentations ~to bring more of us~ so that we can collectively bring each other onto the same page.

I'll try to coordinate a time over mattermost that we could have a call...sometime after the ophyd-async call that is scheduled for next week.

padraic-shafer commented 9 months ago

In the meantime, will leave this "issue" open to collect more input and responses.

ZLLentz commented 9 months ago

At LCLS we've, for lack of a better term, "embraced the chaos" of having multiple pathways and haven't been burned by it yet.

In practice, the sort of multiple-user issues described here don't actually happen. Beamline staff know not to mess with the system during data collection (if they did, it is at least logged), and having ways to protect/gate this that route most of our usage away from the standard ecosystem of EPICS client tools with a custom proxy isn't something that we have considered doing. For us, this is more likely to interfere with our operations than enhance it.

If we were going to implement something like this for our EPICS system we wouldn't need any special consideration in queueserver at all- we'd figure out some system of dynamic access control where during experiment runs the IOCs would only accept writes from the queueserver, and not from the operator consoles. But even this seems like a misallocation of effort to me.

In short, I'd prefer to configure the control system's user access control than rewrite any of the client-side software, but I also don't see a compelling reason to do this. Maybe the situation is different at other labs.

dmgav commented 9 months ago

My understanding is that we are discussing single pathway for remote beamline operation, specifically for the cases when the experiment is conducted/monitored by remote users. I have an impression, that attempts to block write access to PVs for on-site staff may not go very well for some beamlines.

prjemian commented 9 months ago

I agree.

On Mon, Sep 25, 2023, 7:08 PM Dmitri Gavrilov @.***> wrote:

My understanding is that we are discussing single pathway for remote beamline operation, specifically for the cases when the experiment is conducted/monitored by remote users. I have an impression, that attempts to block write access to PVs for on-site staff may not go very well for some beamlines.

— Reply to this email directly, view it on GitHub https://github.com/bluesky/bluesky-queueserver/issues/292#issuecomment-1734630876, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMUMERF4DVSJJ7LTW2EGLX4IMGZANCNFSM6AAAAAA5EH56ZY . You are receiving this because you were mentioned.Message ID: @.***>

ZLLentz commented 9 months ago

I think it makes more sense in the context of remote users (this wasn't previously mentioned in the above discussion), though I suspect the beamline staff can find ways to communicate with the users for the ongoing experiment remotely and work out any conflicts without needing a software solution.

LCLS doesn't have remote users so if that's the primary consideration it makes sense for me to have a differing perspective here.

dmgav commented 9 months ago

In this context, a remote user is anyone controlling the beamline from an off-site location, e.g. a beamline scientist working from home may be considered a remote user. I guess the goal is to develop configurable system components that cover all reasonable use cases.

padraic-shafer commented 9 months ago

If every change to hardware state is captured through the runengine, you should be able to seek backwards through the documents and piece together the current state of the instrument.

This is another notable piece of the puzzle. I suppose a question to consider is: Do we adequately record the state of the system (in EPICS record databases, or bluesky baselines, or..) to reproduce the experiment at a later time? Or do we need to recreate state through an event log or some other change data capture stream?

In the latter case, it's immensely helpful if the logs all end up in one place. So either every change request goes through some central coordinator, or each service needs to handle logging to a central log.

This line of thought is diverging from the original discussion, but I think it's useful to keep in mind.

bluesky / bluesky-queueserver

Intentionality in (not) bypassing the queueserver #292