Event driven KERIA operations

iFergal commented 2 months ago

After an async Signify operation is successfully accepted by KERIA, a long running operation is stored and the oid returned to the client.

The client can then use this oid to poll the status of the operation until done=True and process any additional metadata.

I don't think is scalable long term because even a single client may be polling many operations at once with separate HTTP requests. And in general if those clients were other microservices in the same infrastructure (as opposed to e.g. an edge mobile wallet) it would be much nicer to react to events rather than tracking and polling in the background.

Not sure yet on the best approach here. A nice first step might be an interface to dump signed messages (maybe that could be plugged into standard message brokers?). Longer term solution would need KERI authentication to listen for events.

edeykholt commented 2 months ago

Use of websockets is a decent alternative design to polling or having a callback endpoint. See https://indicio.tech/handling-high-volume-verification-with-socketdock/

iFergal commented 2 months ago

Yeah, websockets are an option. Message brokers can use things like AMPQ and MQTT but perhaps to not complicate architecture there could be a built-in websocket endpoint which serves signed events, as a default and a way to acknowledge events.

The trouble of course is it's too much complication to make that as reliable as a message broker but it would be a starting place.

Would have to think about authentication is done for websockets though. Acknowledging events would need to be signed...

kentbull commented 1 month ago

I agree, this is not scalable long-term. Mailbox implementations are something we should consider here as well because setting up a websocket communication endpoint makes sense for both KERIA agents and mailbox agents. At some point I imagine KERIA will support mailbox agents.

All we have to do to make KERIA compatible with WebSockets is to make a Doer or DoDoer that services websockets in a nonblocking way like KERIA currently services TCP sockets in HIO.

And maybe we should look at other transport mechanisms in light of today's discussion of how ESSR could make HTTP unnecessary. Yet, starting with something people are familiar with, HTTP and WebSockets, seems like a good idea.

iFergal commented 1 month ago

One concern that @lenkan is horizontally scaling KERIA and websockets straight out of KERIA making that more complex, and I tend to agree.

Most likely, I might do a stop gap solution where I publish to some internal broker for other microservices to pick up, but doesn't work for zero trust networking and also in general the acknowledgements should be somehow authenticated.

kentbull commented 1 month ago

As long as your internal broker makes sure that the websockets messages get to the right agent and keeps trying until they get delivered then I think that would work. I've tended to think that websockets should go from agent to browser with message acknowledgment built into the agent itself. Maybe I'm wrong there. That is a bit of complexity, though you end up having to deal with it no matter what you do if you want event-driven KERIA, either in your own custom internal broker or in KERIA itself.

Its a common enough need that at least a basic implementation in KERIA seems warranted.

iFergal commented 1 month ago

By internal broker, I didn't mean internal to KERIA - I just meant internal to my infrastructure and without proper authentication as a stop gap solution. So not necessarily websockets at all.

But do agree that a basic solution within KERIA makes sense too as a default to not have KERIA rely on other infrastructure, so long as it's easy to integrate different brokers etc in a uniform way for more complex deployments.

For example, Spring cloud stream binds to a bunch of different broker types - https://spring.io/projects/spring-cloud-stream - the tricky part is KERI based authentication.

SmithSamuelM commented 1 month ago

This might be helpful. In distributed systems with layered protocols, transactions can live at different layers in the stack. For critical systems, the only place where the absolutely highest level of reliability and durability of a transaction is when the transaction lives at the application layer. These are often called end-to-end application layer transactions. This means that the two ends of the transaction live at the application layers of the two protocol stacks (or all protocol stacks if multi-party transaction). If the transaction must be ensured at least once, i.e., the transaction must be complete no matter what faults may occur, then the transaction state must be durable, as in live forever until it either completes or the user explicitly cancels the transaction (which is a type of completion). This means usually the transaction state is persisted to a durable database, and the transaction itself never times out. It lives forever with indefinite retries, usually with an exponential back-off.

AFAIK, we don't have truly, persistent end-to-end application layer transaction support in KERIA. It seems there is an expectation that somehow, KERI core (which is not at the application layer relative to KERIA) is supposed to be responsible for ensuring application layer end-to-end reliablity. It's not. Attempting to make the keri escrows act that way is self-defeating.

The mailbox on the other hand could be modified to support true end-to-end reliable application layer transactions, but I believe it is missing some features for that to be true. And the mailbox is delivering notificaitons which are not typically an application layer service they are usually a lower layer service.

Usually an application layer transaction is some action that needs to happen in order for the application to execute a workflow. A mailbox might be an intermediary in helping or supporting that action to happen, but the persistence of that action and what resources it brings to bear to ensure that the action happens is usually more than that. And for it to be persistent is must survive the host os being rebooted or crashing and losing in memory. which means a transaction record that durably persists transaction state is needed.

Now exchange messages could but augmented to support end-to-end application layer transactions by adding delivery, durability, and retry characteristics as well as application action specific data. They are different types of retries. There are unacknowledged repeats, there are retries with polled completion state. The are acknowledged retries with timeouts when acknowledgments are not recieved. etc all of these provide different optimization and trade-offs for reliablity and resource utilization, which in resource constrained environements can have positive or negative feedback loops that affect reliability (like cascading retries that swamp the network so no messages get through etc).

As people are putting KERI into production, they are building applications that are begging for reliability and the only place where true reliability lives is at the application layer. Any lower layer does not have enough visiblity into the goals of the applicaition workflow to know how to be reliable without being counter productive.

Lower layers can only be reliable with respect to the conditions they have visibility into. So for example, keri escrows are meant to soften the effect of asynchronous network faults, not remove them entirely since some faults have very long time horizons.

TCP is well known as a "reliable" service but only for very limited set of faults. TCP connections fail catastrophically in that any in-stream data is lost and can NOT be restored by TCP itself. So a higher layer is responsible for restarting any data that was streamed over a broken TCP connection. Only some parts of HTTP have any additional reliability. Mostly, HTTP just notifies you that a problem has occurred. HTTP SSE events have a client side timeout and attempt to reestablish a broken connection, but the client application is responsible for telling the SSE server where to restart the events if events are lost in transit. etc ....

This is where these discussions should be headed. I think.

iFergal commented 1 month ago

Thanks @SmithSamuelM this is helpful. Though here I'm solely referring to the Signify client being notified of long running operations completing from its agent - these long running operations only live in KERIA and not even keripy AFAIK - longrunning.py. So separate to escrows and KERI core.

Currently Signify applications need to poll KERIA by operation ID to check for completion. Since Signify is on the edge it might not have an endpoint for exchange messages. I'm just trying to avoid polling several operations' statuses in parallel instead of simply waiting for a general completion event with a handler.

--

I mentioned a broker for the cases where Signify is actually part of a cloud microservice deployed alongside KERIA. Arguably those microservices could be agents too but there's no KERI Java. (right now Java->Signify-TS but hopefully Signify-Java in a while, as that's a much smaller undertaking than KERI Java)

But maybe that'd get a lot easier if we had tcp ESSR instead of REST and signed headers like we discussed yesterday.

SmithSamuelM commented 1 month ago

@iFergal It sounds like their might be a benefit from wrapping any long running workflows in KERIA with a workflow transaction state endpoint. The workflow transaction could for example refresh any exn messages needed to finalize a distributed multi-sig event. That way SignifyTS can be dumb and just check the progress of the longrunning transaction.

@pfeairheller I believe described it. The multi-sig group member who initiates a key event sends a notification with all the material needed for each group member to update their contributing AID key state and create the associate group aid event. They can then totally asynchronously sign that event and broadcast it to the other group members. The original notification from the initiator should work in idempotent fashion. A refresh retry of the same notificaiton should result in the same signed event from each group memeber. The idempotent part is that each contributing key state may have already been updated so would not need to be updated again. Only recreate and resign the group key event and re-broadcast. So the initiator's KERIA agent could have a long running transaction that looked for completion of a threshold number of signatures, if after some time, it could retry the initiation. this would refresh any timed out escrows and enable offline group members to catch up. The long running KERIA transaction could have two completion events. Either the threshold is met or the user decides to cancel but the transaction itself never times out. The signfy client could get transaction state updates by either polling or having an SSE push. The long running transaction would be instantiated as a transaction specific doer. And a DoDoer would keep of list of all such transactions in its doers list to keep them running until completion whereup the doers self delete from the DoDoer list..

In general every potentially long running workflow in KERIA could be architected this way so that its just rinse and repeat.

iFergal commented 1 month ago

Expanding the API for refreshing sounds like a good idea @SmithSamuelM - and agreed it should be idempotent. Overall sounds like it'd be a lot more reliable.

Regarding the Signify client receiving updates: I haven't worked with SSE before but it sounds like it could work! Will think about how it might be impacted with horizontal scaling but could be OK.

pfeairheller commented 1 month ago

What you described @SmithSamuelM is pretty much how it works now. Some of the words in my explanation of it may be different, but this is the current architecture.

SmithSamuelM commented 1 month ago

SSE and chunking was introduced in HTTP 1.1 to replace the hackish long polling used for push notifications. SSE is true HTTP which means its supported natively by load balancers etc. Whereas alternatives like web-sockets are not. So in general if you want an event driven push to a client then use an SSE connection. One still uses, the regular rest API to send stuff to the host via post and put, but delayed responses due to long running background processes can come back via SSE. The FireBase ReST API IMHO is an example of how to do SSE well.

iFergal commented 1 month ago

OK great, thanks! Two other things that come to mind are:

If the Signify client is completely offline - we'd have to buffer those event updates if we want to deliver them without the client explicitly fetching all operations on start-up. (cleaner if signify application just has to react to SSE in its code)
I'm curious about acknowledgements. From reading about SSE (but not testing yet) it seems there are just TCP acknowledgements of the message being received, and not explicit acknowledgements from say the SIgnify application that they are finished processing the event (could be triggering other things to happen...). e.g. with something like MQTT you can get explicit acks.

will check out how Firebase use it.

SmithSamuelM commented 1 month ago

In an SSE + ReST architecture, the client talks to the host with explicit ReST API calls. So if the Host needs feedback from the client it is expecting it on an endpoint (put or post). So if you do it right it will operate effectively the same as a two way socket but without leaving http.

Clients have to establish all connections with http.

When the client reestablishes an SSE connection it can signal the server which event to resume on. So it doesn't have to use a separate rest endpoint to synchronize events. But of course these means the client has to keep track of sse events. If the client is depending on the server to keep track of events then it just takes a little more thought, but instead of jettisoning http just to get push, you can do the push stuff with SSE and use ReST otherwise. You stuff have to wrap your mind around a bifurcated connection.

iFergal commented 1 month ago

Makes sense, so the acknowledgement is effectively on the REST side. Would be nice to be event by event too and not just resume from event number, in case there was a particular event that couldn't be processed right then but other more recent events could. thanks for the info!

WebOfTrust / keria

Event driven KERIA operations #290