Enhancing Hardware Command Responses with Detailed Asynchronous Status Updates

tiyash-basu-frequenz commented 3 months ago

What's needed?

We need a more informative way to get responses from hardware components after issuing commands to them.

Currently, the response is merely an acknowledgement and does not reflect the actual state or events at the component. Ideally, responses should include the statuses of all activities our stack performs on the component.

For example, the backend might increase power to 100kW in a 3-step ramp of 10kW, 50kW, and 100kW. Currently, the response would only acknowledge the 100kW command. If the 50kW command fails when sent to the inverter (e.g., due to sudden bounds change or communication errors), the client remains unaware and has to rely solely on the power value in the component data stream. This is suboptimal.

This may involve exposing more low-level information to the clients, so we need to make this optional. Clients who do not want this information should not be required to receive it.

Proposed solution

There are two ways of addressing this:

Proposal 1

Update the set-power command RPCs to return a stream of responses.

Proposal 2

Have a separate RPC that streams command response statuses. In this case, each command should be provided a unique ID from the service, and the streamed response status should contain this ID. (although one client should not get response for commands from another client).

Both proposals are technically feasible from the service's perspective. However, we need feedback from current users to determine their preferences and reasons.

Use cases

With the proposed changes, the client would receive a stream of real-time status updates for each command.

For example, after issuing the command to increase power, the client would receive updates indicating the progress and success or failure of each step in the process. If the command to write to the Modbus register succeeds but the component fails to respond within the expected timeframe, the client would be immediately informed of the issue, allowing for timely corrective actions such as retrying the command, diagnosing the communication problem, or adjusting their operational strategy based on the current status of the component.

Alternatives and workarounds

No response

Additional context

No response

llucax commented 3 months ago

FYI @shsms

llucax commented 3 months ago

So seem to be leaning more towards option 2. Some advantages over option 1:

It should be less resources intensive
Not that we need to right now, but it is more backwards compatible if we keep the old rpc intact and users that need more/better replies can subscribe to the new stream.
If it is more convenient for the client to provide a reply stream per every request, we can still do it at the client level and create a python channel for every request instead of a rpc stream for every request.

shsms commented 3 months ago

leaning more towards option 2.

That sounds good to me.

If it is more convenient for the client to provide a reply stream per every request, we can still do it at the client level and create a python channel for every request instead of a rpc stream for every request.

This is not required for the SDK use cases. A single stream is easier.

tiyash-basu-frequenz commented 3 months ago

Thanks @shsms We will go for option 2 then.

tiyash-basu-frequenz commented 2 months ago

In the past weeks, I was working on similar problem in our proprietary stack, and I discovered that having a global response stream with request IDs leads to a few issues.

The first issue is about complications on the server side - it needs to maintain a map of output streams to client, and response objects from for one client should not go into the output stream for another. This is additional overhead that could be avoided if each request just returns its own response stream.

The second issue is with request ID tracking on the client side - managing request IDs for multiple requests is not trivial. If the client has to care about the responses, then it needs code to track request IDs through timeouts and mix-ups, which is evidently complicated. With a response stream per request, clients have the option to use them straightaway, or to stream them to a global response stream if they need to.

So, I am inclining towards option 1.

shsms commented 2 months ago

I can't think of why the SDK would need to track the request ID. I think we just need the component ID for which a request failed, and the SDK can take action with just that. So from the SDK side, I think option 2 is easier to use.

With option 1 we'd probably need more code, but I don't think there would be any performance cost. So AFAICS, going with option 1 shouldn't be too bad either.

tiyash-basu-frequenz commented 2 months ago

I think we just need the component ID for which a request failed, and the SDK can take action with just that.

The SDK would need more, I imagine. Say discharge commands were sent for 5 inverters. It would need to track the request IDs for each of these 5 discharge commands, otherwise it cannot know for which inverter the command failed.

shsms commented 2 months ago

Say discharge commands were sent for 5 inverters. It would need to track the request IDs for each of these 5 discharge commands, otherwise it cannot know for which inverter the command failed.

Not if the status update contains the component ID of the inverter for which something failed. That should always be there, right?

shsms commented 2 months ago

But never mind, option 1 does sound cleaner, modular, etc. So we can do that.

tiyash-basu-frequenz commented 2 months ago

Not if the status update contains the component ID of the inverter for which something failed. That should always be there, right?

It will, but the caller won't know for which command until the request ID is used. E.g., it could be that the response the caller just got was a response to an older command, which it had stopped caring about after a timeout. So the request ID cannot be ignored in general. We have seen similar usage patterns elsewhere as well. This is also where things start getting more rabbit-hole-ish and complicated with option2.

But never mind, option 1 does sound cleaner, modular, etc. So we can do that.

Will submit a PR soon!

shsms commented 2 months ago

If an inverter had an issue, it shouldn't matter which request it was for, it should be the same action - block the inverter temporarily. This would be the implementation if we use option 1 as well because otherwise, option 1 would also lead to the same problem on the client side, of tracking whether to take action for the responses for a certain request ID.

llucax commented 2 months ago

Well not if you receive first a successful telephone from a latest command and then a failure for an old final, I guess in that case we can consider the component is ok, but it would be a very weird edge case I guess

frequenz-floss / frequenz-api-microgrid