NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a user, I want a messaging service that exposes evaluation status and statistics messages for an ongoing evaluation #204

Open epag opened 3 weeks ago

epag commented 3 weeks ago

Author Name: James (James) Original Redmine Issue: 61930, https://vlab.noaa.gov/redmine/issues/61930 Original Date: 2019-04-02 Original Assignee: Evan


Expected behavior:

Given an evaluation task that produces status information for human consumption, such as warnings emitted and assumptions made, when I execute that evaluation, then I expect to see those messages in a friendly format, as they are emitted.

Additionally, the same format could be used to convey exceptions, which stop the application.

Additionally, the same format could be used to convey information about the progress of an evaluation.

Each of these separate flavors of user-focused messages could be associated with a separate channel.

Actual behavior:

We currently rely on logging to convey information intended for both developers and users. However, logging is a tool for developers, not users; logging is not an appropriate mechanism to convey information to users about assumptions made by the system or warnings that a user should consider.

Separately, exceptions are propagated and displayed to a user with a string representation of the exception stack that contains superfluous information from a user's perspective (but is useful for developers).

We do not currently provide a service, with an associated API, that allows a user-facing client, such as a GUI, to subscribe to information (or multiple channels or information) about a particular evaluation task.

Implementation notes:

Implementation notes are not instructions, but items for discussion and opinions of the ticket author about what could make sense. See the ensuing discussion for more.

It is assumed that most - probably all - information intended for users is associated with a particular evaluation. An evaluation is a unique execution of a project declaration and its associated data. If a more general failure occurs that is independent of a particular evaluation, it is anticipated that the user would see a more generic message from the service (i.e. an 5xx http error code with generic information), and detailed information for developers would be conveyed to developers via logging and other monitoring.

Thus, the starting point for any user-facing messaging service is that each message is connected to an evaluation, which is uniquely identified as a computational instance (uniquely identified across all components/microservices).

The underlying assumption is that an asynchronous pub-sub architecture is the correct architecture for messaging, supported by a messaging API. The user facing service will sit on top of that, and the correct protocol to support client subscriptions is probably XMPP (asynchronous), rather than HTTP (synchronous request/response). However, the latter is probably simpler, to begin with, and is advocated by at least one developer as a starting point (e.g. #61855-32). The former is inherently better suited to pub/sub style communication. There may be various workarounds for HTTP, such as polling by the client (not good), a long-running GET (not good) and a callback/WebHook (perhaps), but these will be workarounds, I believe.

I am tentatively marking this ticket as blocked by #40271 (see #61855-27). In principle, this ticket is orthogonal to #40271, but #40271 is necessary for the correct operation of the WRES as a long-running service, and the ultimate aim is to support pub-sub style communication between a client and a long-running WRES instance (via a long-running messaging service instance in a separate process than the long-running WRES instance).


Related issue(s): #140 Redmine related issue(s): 54248, 58715, 61855, 64542, 67088, 73851, 75111, 77758, 80608, 81764, 85506, 85743, 93057, 96337, 99964, 100560, 113677, 119337, 122191, 123635, 129698


epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2019-04-02T12:24:45Z


Some discussion in #61855-27 through #61855-33.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Chris (Chris) Original Date: 2019-04-02T14:11:13Z


I don't know how much this helps, but the WRESRunnable and WRESCallable come equipped with rudimentary event handlers for start up and shut down. That shutdown is currently being used for the goofy "progress monitor".

Regardless of the direction, I'd love to be able to a) tie the "instance identifier" to the job itself (i.e. the job ID through the service used with @curl@) and b) have something that I can poll/query/whatever to show "Hey, the application isn't frozen/deadlocked; progress is being made".

I am completely unfamiliar with XMPP and I don't even know where to start figuring things out by looking through their website. Is there someway I can tie the web application to it to get semi-live updates from the application? While polling via http may not be ideal, it is the super obvious/naive solution for updating the client.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2019-04-02T14:23:01Z


I agree that polling is a solution of sorts, but it's ugly, and I would prefer to get the architecture right sooner, rather than later, because we tend to keep quick fixes indefinitely.

I think the architecture is basically wrong with polling. It's a time/interval-based pull model. What we want is an event-driven push model, i.e. when an event occurs in the application it pushes some information that triggers a sequence of events that end in the subscribing client being notified about a new message on a given channel. Beyond that, I don't actually care about protocols and other implementation details, if that makes sense. HTTP/BOSH is another thing to look at. I mentioned XMPP because I've come across it before, but we will need to work out the implementation details. The architecture is what I am talking about first.

Still, I think we have the same desire w/r to the end state, namely a straightforward mechanism for a client to be notified of a new message of a given type that can be displayed to a user, with all aspects of that being as simple as possible, but with a sensible architecture.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Chris (Chris) Original Date: 2019-04-02T14:49:25Z


I'm just thinking on the client/javascript side of things. There's plenty of pub/sub tutorials about handling it within the client itself, but not much for attaching to outside services. I'm sure we can figure it out, though.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2019-04-02T14:52:35Z


Given the difficulty in adoption of HTTP in OWP, I think we have to keep things super simple if people are going to use the software. As a software engineer, I would much rather code against an HTTP service than have to learn all the internal details of the software just to get some information from the software. Even if we go the messaging route, for external (non-WRES-team-coded) clients we still will want to wrap it in HTTP.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2019-04-02T14:53:47Z


AMQP makes sense, I could maybe be convinced to use XMPP as well. But I think AMQP is going to be more familiar to developers.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2019-05-07T13:55:30Z


Regarding the @WARN@ vs @ERROR@ discussion toward the end of #63408, from #63408-24 through #63408-31, with this ticket in mind:

There are places in the code where logging a message can be viewed as a placeholder for more formal mechanisms in the future. I think the validation stuff fits in there.

As for whether to push an event or synchronously return and gather all events, I think both are appropriate but I think the official or canonical answer at the end of the evaluation should be from the synchronous return of the evaluation, with the side-effects of pushed events being extra, kind of like logging is today. A logged message implicitly says "I'm doing a thing, I'm making progress", and an event being pushed in the future could say the same.

I recommend against the use of side effects as a primary mechanism for gathering up the results of an evaluation, but I am not against the use of those side effects in providing live feedback. In other words, we can treat the side-effect event as an early-preview carbon-copy of the official message that will come at the end of the evaluation.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2019-06-26T11:54:21Z


Note #64542-20.

Idea of this ticket is not only about the servicing of messages to users, but about a canonical format for relaying runtime messages, regardless of content (warnings, exceptions, assumptions). In other words, I want a canonical format for user-focused runtime messages, as well as data (pairs, statistics) outputs. These two things may or may not use the same technologies. These two things will probably need to be disaggregated into separate tickets in due course, one about the message format, another about the service API aspects.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-05-05T13:03:00Z


This ticket is intended to form part of a broader messaging api, rather than an upfront validation function, but it is worth noting that the @DeclarationValidator@ now provides a top-level validation function that returns all the early evaluation status events (schema and business logic validation) in a canonical format, @EvaluationStatusEvent@ (#113677-186). More generally, the @EvaluationStatusEvent@ is the type of evaluation status message that the messaging api will need to use to communicate with subscribers.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-05-05T16:31:56Z


I think this is the best ticket to post this in...

Chris chatted me this idea:

Hey, I just got an alert about an update on ticket #81764 and that reminded me - you guys should look into websockets and redis for messaging. It's something I got together for DMOD. You either connect to a websocket view or subscribe to the redis channel and you can get messages in real time and not disrupt an eval. Your connection can drop and all sorts of craziness can happen and you can still get up to date info on what's happening in the eval without having to rely on the logs. Would make long evals waaaay more convenient to check in on since you'll be able to see things in a more controlled fashion and you can have stuff like debug messages logged without tripping up or confusing users.

Stuff your messages into JSON and now a user can filter messages by event/cause and such.

Redis can be readily used for message publishing, but we would need some way to get the message from the worker and worker-shim to Redis. Presumably that would be via the broker and tasker, much like how current job metadata is handled.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-05-05T16:49:47Z


We already have a low-level messaging api that uses the amqp protocol, so any subscriber that wants to listen to evaluation status messages about an ongoing evaluation can already do that by subscribing to our eventsbroker but, in practice, this broker isn't actually exposed to external subscribers, just to wres client apps. This is partly because we haven't secured the amqp traffic. Regardless, the low-level stuff is all in place, including the message format itself, which allows for similar alert levels to logging levels.

Mechanically speaking, if we wanted to further expose to other middleware/protocols, I think we'd need a shim of some kind, like an additional eventsbroker subscriber that further published the evaluation status messages to a redis channel, flowing onwards to whatever subscribed to that channel. We'd probably want to keep those messages in our canonical/protobuf format. They're very efficient and easy enough to work with across many languages, as well as serialize to json.

It's "just" a case of finding the time/resources to join some dots.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-02-07T20:05:02Z


Something to clarify here is that this ticket isn't just about status messages, but also statistics messages. Example: gis user sees evaluation statistics appear on their screen, location-by-location, as they are completed in real time.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-02-07T20:26:26Z


I think this looks the first ticket I would address for this work. Going to look into setting up AMQP channels/broadcast in the evaluator for output to start off since that will also address the goal of #122191

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-02-07T20:33:03Z


I will say that this is a very ambitious ticket, Evan. There is no estimate but it is probably in the many tens of hours. For example, if we are broadcasting amqp messages directly then the amqp traffic will need to be secured with tls. Currently, only the broker monitor is secured because the amqp traffic is not exposed for public consumption. The other consideration is whether we want to expose the amqp traffic directly as an advanced api or to use some intermediary, like expose amqp over a websocket so that it can be consumed by a browser or similar. Dealing directly with amqp places a fairly high bar for service integrators downstream, although it has some advantages too. Another downside is that we should not broadcast all evaluation messages in a firehose. A user should really only be able to access their own messages by evaluation id. Obviously, if we expose the amqp traffic directly, they can sniff all messages across all evaluations.

Bottom line, I think there is a lot more work here than first appears. Do you want to get buried in this for many weeks? We should discuss as a team on Thurs.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Evan (Evan) Original Date: 2024-02-07T20:34:33Z


Sounds good, ill find something to keep myself busy today and tomorrow and leave this for discussion tomorrow

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-02-07T20:36:55Z


I had originally envisaged that we would do this work when a downstream user had a pressing use case, like real-time consumption of evaluation status or, indeed, statistics. The closest thing we have to that requirement now is the gui - it would be nice for a user to have something better than logging - but this is probably not a super high priority and it would also require some fundamental work under the hood to ensure that (some of) the places where logging is currently used, messages are broadcast too. Currently, the messaging api is all post ingest, so there is a good chunk of an evaluation that isn't using the messaging api at all.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-02-07T20:39:23Z


Evan wrote:

Sounds good, ill find something to keep myself busy today and tomorrow and leave this for discussion tomorrow

I should be able to take a look at the amqp client stuff tomorrow and hopefully get that into a place where you can add more clients relatively easily.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2024-03-13T17:22:08Z


Another example of the need to better expose warnings to users in #123635-191.