As TD, I want to discuss and agree on a format and schema for the messages passed between the WRES and the event detection service - Githubissues

NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service

Other

2 stars 1 forks source link

As TD, I want to discuss and agree on a format and schema for the messages passed between the WRES and the event detection service #230

Open epag opened 3 weeks ago

epag commented 3 weeks ago

Author Name: Hank (Hank) Original Redmine Issue: 80679, https://vlab.noaa.gov/redmine/issues/80679 Original Date: 2020-07-09

As stated in the meeting, this ticket is to facilitate/track discussion and agreement in an interchange format. From the meeting notes edited to be clearer:

Want a binary wire format, maybe Protobuf. Human readable would option may be good as well. We did an evaluation of different options for WRES and Protobuf was selected. WRES has some code now that may be reusable to “some degree”. It’s used in the context of canonical output. The WRES Team is migrating the software toward a different architecture supporting separation of different capabilities, such as pairing, metric computation, etc. WRES using Protobuf could be one factor in the decision of what to do for event detection. There might be value in abstracting time series format in Protobuf for event detection work so that it could be reused by WRES and others.

James: Please relate this to the tickets that explored different options for the canonical format.

I've added watchers to this ticket. If discussions outside of this ticket are held, it would be good to summarize them here. Thanks,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-10T08:42:46Z

Thanks, Hank.

I also had a brief exchange with Austin out-of-band where I provided the link to #79741.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-10T08:57:07Z

I think it would be good to support a binary wire format. I think others expressed a desire to also support a text format, such as json. I don't necessarily understand that desire, but I think we agreed to additionally support a binary format.

In terms of which format, I would advocate for one of either protocol buffers (v3) or cap'n proto. Within the wres, we have provisionally opted for protobuf, but cap'n proto has many strengths too. I should note upfront that I am a newcomer to binary wire protocols and messaging formats, so you should take my opinions as guidance.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-10T09:08:12Z

Regardless of message format, I do see value in defining a common abstraction of time-series that could be reused across projects and applications. It would be really nice to have a common abstraction in a portable message format. Afterall, time-series are integral to pretty much everything we do. That said, even if the ambition is for the event detection work only, you need to make a choice.

Two particular advantages to note w/r to cap'n proto.

First, it is a so-called "zero copy" format, unlike protobuf. It uses arena memory allocation, which means that each message composition occupies a contiguous block of memory. That allows it to take the same format in memory and on the wire. Thus, there is no change of state (aka copying) as it crosses the boundary between kernel memory space and the wire (and vice versa). The message is simply dumped onto the wire and vice versa. This saves memory consumption and cpu cycles. There is similar scope to save when crossing the boundary between kernel space and application/heap space.

Second, and perhaps less importantly, it has a much richer schema language than protocol buffers.

https://capnproto.org/language.html

For example, it supports interfaces and generics. This offers better scope to define a schema for a time-series that could accommodate any of the following types of time-series with a common interface:

Time-series of natural numbers that comprise single values;
Time-series of natural numbers that comprise an ensemble of values;
Time-series of categorical outcomes that comprise single values;
Time-series of categorical outcomes and associated probabilities of occurrence; and
Time-series of parametric probability distributions (e.g., a weibull distribution and associated parameter values for each time-series event).

All of these things are either currently needed or will be needed, eventually, but probably only a subset of them will be needed for event detection (I would guess only 1 and 2). Still, we can also define these things in protocol buffers without a common interface.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-10T09:21:54Z

The main reason we've provisionally opted for protocol buffers within the wres is that they are simpler to use in pretty much every respect. For example, they are simpler to integrate into your code base (e.g., using gradle) and they are simpler to build incrementally. It is harder to make stupid mistakes that obviate the performance advantages. Both formats have a schema language and a compiler, which provides language specific bindings by generating the language-specific source code from the schema. There are language bindings for many languages (c++, java, python etc.), but the canonical implementation is c++ in both cases.

With our own microbenchmarks, we found little performance gain with cap'n proto, but that will not necessarily apply to your application. Indeed, it probably won't. You can find some excellent threads by Kenton Varda (author of protobuf 2 and cap'n proto) in various places that strongly warn against microbenchmarking these formats. They really need to be compared at scale, which presents something of a catch-22.

( By the way, I am intentionally not discussing messaging architecture or protocols here, which are separate topics, but important. )

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-10T09:26:49Z

I will try to moot a schema for a time-series in protobuf by way of example.

Agreement about the abstraction is probably more important than agreement about the messaging format. I note that wrds also has an abstraction of a time-series, along with every other service that serves time-series. It's worth noting that nwis does not serve time-series per se, rather collections of events (which is fine, because observations can reasonably be abstracted as a continuous time-series anyway, bounded artificially by the request).

( The latter is one reason that it's hard to cache/identify time-series from nwis because the request is essentially arbitrary/application-specific - the caller decides about how to chunk collections of events for use/identification/re-use as "time-series". There is no natural, time-series, boundary. )

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-10T10:31:52Z

Here's a potential schema for discussion. Formatted in protobuf for illustrative purposes. All fields in protobuf are optional for forwards/backwards compatibility (no fields are null/empty, defaults are not sent on the wire).

A time-series:

Is a collection of zero or more events (or instants or occurrences); Each event has precisely one valid time and precisely one value; and The valid time is the time at which the event value is valid ("the event occurred at this time").
Has zero or more reference times, such as a forecast T0 or an issued datetime;
Has one variable, one time-scale and one geometry; and
Is regular or irregular (i.e., events can be separated evenly or unevenly on the timeline).

The following abstraction would not accommodate time-series that are composed of events that are something other than (or more than) a natural number.

You may prefer a different nomenclature for "event" given the context, but event is the natural term, I think. When we talk about events in different contexts (hydrologic event detection, probabilistic event as a subset of a sample space, an event as an occurrence, a thing that happens etc.) there is inherently scope for confusion. Event in this context does not mean "hydrologic event" as it relates to hydrologic event detection (there is a many:1 relationship).

The @timeseries.proto@ (a @google.protobuf.Timestamp@ is an instant on the timeline, independent of timezone or calendar):

syntax = "proto3";

package some.package;

import "google/protobuf/timestamp.proto";
import "reference_time.proto";
import "time_scale.proto";
import "geometry.proto";

message TimeSeries
{
    // A time-series.

    repeated ReferenceTime reference_times = 1;
    // Zero or more reference times.

    message Event
    {
        // An abstraction of a time-series event, which composes a
        // single valid time and a corresponding natural number value.   

        google.protobuf.Timestamp valid_time = 1;
        // The valid time of the event.

        double event_value = 2;
        // The event value.
    }

    repeated Event event = 2;
    // Zero or more events, an empty time-series being possible.

    string variable_name = 3;
    // Variable name.

    string measurement_unit = 4;
    // Measurement unit associated with the time-series values.

    string series_name = 5;
    // The series name, such as the name attached to an ensemble 
    // trace.

    TimeScale time_scale = 6;
    // The time scale associated with the time-series values.

    Geometry geometry = 7;
    // The geometry from which the time-series originates.
}

The @reference_time.proto@:

syntax = "proto3";

package some.package;

import "google/protobuf/timestamp.proto";

message ReferenceTime
{
    /*A message that encapsulates a reference time and associated type, such
    as an issued time or a model iniitalization time (T0).*/

    enum ReferenceTimeType
    {
        // Type of forecast reference time

        UNKNOWN = 0;
        // An unknown reference time type.        

        T0 = 1;
        /*The time at which a model begins forward integration into a 
        forecasting horizon, a.k.a. a forecast initialization time.*/

        ANALYSIS_START_TIME = 2;
        /*The start time of an analysis and assimilation period. The model 
        begins forward integration at this time and continues until the forecast 
        initialization time or T0.*/

        ISSUED_TIME = 3;
        // The time at which a time-series was published or "issued".
    }

    google.protobuf.Timestamp reference_time = 1;
    // The reference time.

    ReferenceTimeType reference_time_type = 2;
    // The type of reference time.
}

The @time_scale.proto@:

syntax = "proto3";

package some.package;

import "google/protobuf/duration.proto";

message TimeScale
{
    /*A message that encapsulates the time-scale associated with a time-series 
    value.*/

    enum TimeScaleFunction
    {
        /*An enumeration of functions used to distribute the value over the 
        period.*/

        UNKNOWN = 0;
        MEAN = 1;
        TOTAL = 2;
        MAXIMUM = 3;
        MINIMUM = 4;
    }

    TimeScaleFunction function = 1;
    // The time-scale function.

    google.protobuf.Duration period = 2;
    // Period over which the value is distributed.
}

The @geometry.proto@:

syntax = "proto3";

package some.package;

message Geometry
{
    /*Elementary representation of a geometry with a spatial reference 
     identifier (srid) and well-known text (wkt) string.*/

    string wkt = 1;
    // The geometry string in wkt format.

    int32 srid = 2;
    // The spatial reference identifier.

    string name = 3;
    // User-friendly short name for the location. (e.g., DRRC2).

    string description = 4;
    // User-friendly description of the geometry.
}

epag commented 3 weeks ago

Original Redmine Comment Author Name: Chris (Chris) Original Date: 2020-07-10T12:53:48Z

I think others expressed a desire to also support a text format, such as json. I don't necessarily understand that desire, but I think we agreed to additionally support a binary format.

Integration costs and difficulty. If I'm putting together an official, formal and official OWP product, we want to go with the binary format for the obvious reasons. If I'm trying to write a one-off script or some tool off to the side or I'm prototyping something to see if it's a reasonable idea, having the binary format be the only point of entry becomes a problem. Sure, the object provides forward and backward compatibility, but every time something is changed, a line of communication will need to be opened in order to distribute and understand what's going on. All the sudden an effort that is intended to take an hour or two is taking days since it's no longer "What is there to use?" but instead "Hey, can I get either the built messages or the schema to build them myself? Why? I'm trying investigating a hunch. Who is asking you to do this? No one; I'm investigating a hunch. Why are you doing it that way? I'm trying to see if this can be done. We don't support that; don't do that. I'm not making an official product; you all have information, I want to see if I can leverage what you have for this short task. Let's have a meeting to discuss this".

It also makes it easier to debug. If you can output something like JSON, you can put together something in your sleep to intercept it and read it during the dev process, otherwise you need to prop up industrial machinery.

If the system is going to be hard coupled and only ever hard coupled (like @wres-io@ and @wres-metrics@), going out of your way to provide a method of investigating intermodule communication is...just...why? Why would you do that? If they are truly separate elements, sticking to binary formats at the beginning will work, but you can easily screw yourself in the long run by designing yourself into a hole. If something that can't use the binary message format needs to integrate (like a browser), you now have to do funky stuff like either having the browser redirect to the back end, making a request to the server, breaking apart the response, converting to a browser consumable format (json), ship it back, then debug any issues along the way all instead of just sending a request from the browser and getting a response.

TL;DR: providing a human readable response in addition to the binary (not the other way around) breaks the coupling to not only the COWRES specifically but the team and organization as well. This is important if the service is essentially just a serverless function. It sounds like protobuf and whatever products support conversion into JSON so being able to support both doesn't sound like much of a problem. One thing to keep in mind is that JSON in java is 9 types of a pain in the butt. JSON is easier in C++ boost for god's sake. JSON in most other languages is a godsend. We're rapidly speeding into the future so it's only a matter of time before JSON starts going the way of XML, but that's still in the future.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-10T12:59:48Z

Yeah, I probably shouldn't have mentioned it. My bad. I wasn't trying to open a can of worms. I just look at the tiny effort required to convert protobuf or cap'n proto into a text format and I wonder what you gain from a text format, especially if you're compressing it. It isn't as though it helps you inspect packets on the wire. But I am not going to argue the point. I accept our collective agreement that text will be an option and that text is de facto human readable, without a translation step.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Chris (Chris) Original Date: 2020-07-10T13:46:49Z

Oh, I wasn't trying to open a can of worms; I was just explaining the outside integrators perspective on why the text format would useful. It can be hard to word things so you don't sound like you've got a bone to pick or something; I'm in a pretty decent mood today, so I'm not trying to sound grumpy.

It sounds like we're all mostly on the same page of starting out with the binary with maybe leaving a path (not implementing) to providing a more friendly format further down the line. I don't think anyone is necessarily clamoring for experimental integration just yet, so a json-first approach is somewhat asinine. I think the schema you posted is pretty solid. Personally, I'd start with that and just add to it if I encounter any pain points.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-10T13:51:31Z

Sounds good, Chris. I don't think it was you opening the can of worms, it was me with my snide remark. Also, I think several people expressed a desire for a text format, so I think that was the majority opinion anyway.

epag commented 3 weeks ago

Original Redmine Comment Author Name: arthur.raney (arthur.raney) Original Date: 2020-07-17T19:02:36Z

Thank you both for facilitating and adding to this conversation. It seems now that we are on the same page moving forward, namely that the non-binary format implementation is not related to the services integration with WRES. Thanks providing the schema James, that's incredibly helpful. The design i've implemented thus far was vastly more simplistic, however I think your material is much more robust and operational. Thanks again!

I have one point of clarification: What is the utility of the TimeScale message? The enum properties max and min threw me for a loop. I am certain that I am misunderstanding its usage, can you clarify?

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-17T19:12:13Z

Hey Austin,

Simple is also good, so it's just a suggestion. We don't necessarily need to start with all of it.

Basically, if you don't know the time scale associated with the events in a time-series, you don't know whether the time-series is an apple or a pear.

An example of a timescale is a +mean average+ streamflow over a +PT24H period+. The two highlighted elements in that last sentence define the time scale.

To generalize, a time scale is composed of a period or duration and a mathematical function that distributes the value over the period. Typical functions are: mean, minimum, maximum, median etc. The minimum and maximum are useful for temperature. For example, it is common to record the minimum and maximum temperature within some period (e.g., one day).

This is intended to be sufficiently general to apply to a variety of variables, not just stage or streamflow, and to be of value for things other than event detection. It may be that the time scale information is not used for event detection, at least to begin with, because the detection algorithm is scale-invariant (it just looks for peaks). However, the time scale is important when describing a time-series in general.

Does that make sense?

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2020-07-17T19:15:09Z

It may have actually helped if I had included @desiredTimeScale@ in my training example walk through. It is mentioned in the slides, however. Still, an oversight on my part. D'oh!

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-17T19:16:13Z

And, actually, it probably does have implications for the way you describe an event too.

For example, if the events are detected for streamflow amounts that represent averages over a PT24H period, does the event start when that period starts? I would say, yes. For instantaneous data, there is no ambiguity. For data with a temporal scale larger than instantaneous, there is ambiguity. So I think the event descriptions probably need to record the time scale at which events were detected (and a protocol for what the start and end of an event means w/r to a non-instantaneous time-scale) and, to do that, the time-series also needs to record it.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-17T19:19:53Z

Oh, one more thing. This is the simplest possible description of a time scale. Whether a time scale "starts at" or "ends at" a valid time or straddles the middle of one or something else is unspecified. I think the default assumption is that the time scale is the period that "ends at" the valid time, but this may not be universal across data sources (in fact, I know it isn't).

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jason (Jason) Original Date: 2020-07-17T19:30:17Z

Time scale is absolutely necessary for event detection. Time scale directly influences the responses to physical processes we can resolve as "events" from some time series. Imagine trying to resolve passenger vehicles in a satellite image with a spatial resolution of 1 km.

On Fri, Jul 17, 2020 at 2:19 PM <vlab.redmine[removed]> wrote:

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-17T19:38:56Z

OK, glad to hear that, Jason - thanks.

I did wonder whether you might start by serving events for instantaneous time-series only, for example. In that case, we could probably defer the description. But I don't think we should.

epag commented 3 weeks ago

Original Redmine Comment Author Name: arthur.raney (arthur.raney) Original Date: 2020-07-17T20:54:04Z

Thanks for the explanation, I completely understand and see the inherent need for the data resolution like parameter. I think the name of the message threw me.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-17T21:22:05Z

I see. I think there are a few different terms that overlap. I think time scale or temporal scale and space scale or spatial scale are probably the most useful/generally applicable terms, but I am open to other opinions on that. The term "support" is probably more common among mathematicians/statisticians and is used widely in geostatistics. I think "resolution" is common in GIScience. I guess my only objection to resolution is that it can be used to refer to the frequency of measurements, i.e., the spacing between times or locations, which has nothing to do with scale, even if the frequency and the period associated with the scale are often the same. For example, daily average values are often reported once per day (not overlapping), but they could be reported once every 5 days. The scale here is a one day mean, the frequency is 5 days.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-17T21:29:46Z

Control volume is another one, more from fluid mechanics/physics, I think.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jason (Jason) Original Date: 2020-07-20T12:25:16Z

Since we're working in the overlap between several sciences, I sometimes use resolution, scale, support, frequency, and sample rate interchangeably in conversation. I have no strong opinions about the terminology, so long as it's used consistently.

In the context of fluids and thermodynamics, I think of "control volume" as the spatial analog to a period of observation. You could sample points or vertices in space (i.e. mesh) to produce a field of control volumes (computational elements). These control volumes have a characteristic size that is the spatial resolution or scale (of the model, observation space, process, etc.). "Control volume" also has other physical characteristics beyond what may be implied by the terms mentioned above.

With regard to the last example, I would use a measurement analogy. If a single daily averaged value is reported every 5 days, then the sample time or sample period is 1 day and the sample rate or frequency is 5 days. From a statistical point of view, I might describe the "scale" as 1 day. However, if I were trying to extract features or patterns from a signal with a frequency of 5-days, I might refer to the "scale" of those features as "greater than 5-days".

On Fri, Jul 17, 2020 at 4:30 PM <vlab.redmine[removed]> wrote:

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-20T12:42:25Z

Jason wrote:

I have no strong opinions about the terminology, so long as it's used consistently.

Same here. Unless someone else has a strong objection to "time scale", I propose we stick with that, but I am open to whatever makes it easier to communicate among us and I think we'll need to accept that in this area, as many others, nomenclature is not fixed/universal. fwiw, we use "time scale" in the wres.

epag commented 3 weeks ago

Original Redmine Comment Author Name: arthur.raney (arthur.raney) Original Date: 2020-07-20T15:46:33Z

Having heard the above, I am in agreement with you James -- I will stick with your suggested naming convention. As you pointed out, this will help with consistency across products which is important. To speak to your comments, I was thinking of "resolution" and "scale" as analogous terms as they are often used in the GISsciences. The distinction between frequency and data scale absolutely makes sense -- thanks for offering examples and an explanation. Coming from a GIS/Remote Sensing background, usage of the term "resolution" is something that we as a community need to work on.

To continue the conversation of schemas, we will soon need to start thinking of a returning message schema -- at least that's my thinking. I will peak the WRES protos for inspiration, but this may be an area where we lean more on Jason for guidance. I say that given that I am not sure of future event-detection work and the fields/resultants that such a product might return.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2020-07-20T16:14:21Z

arthur.raney wrote:

To continue the conversation of schemas, we will soon need to start thinking of a returning message schema -- at least that's my thinking. I will peak the WRES protos for inspiration, but this may be an area where we lean more on Jason for guidance. I say that given that I am not sure of future event-detection work and the fields/resultants that such a product might return.

Agree. I think there are two parts to this: 1) the event description payload; and 2) the message metadata, which is connected to the chosen protocol (e.g., AMQP over TCP, but I think Nels was keen on websockets over TCP and I am not familiar with how metadata is layered onto the websockets protocol).

I am assuming that there are N events per one time-series and that each event has a start datetime and an end datetime on the UTC timeline (because that is what the time-series prescribes too). I am further assuming that we have a rule for how each datetime relates to the time scale of the data (e.g., the start or end of the period over which the value is distributed).

Additionally, I think we'll need to pass through all of the metadata that describes the time-series, because that also describes the events. In other words:

The variable name;
The measurement unit of the variable (this doesn't help to clarify datetimes, but probably helps with some other measures that are expressed in units of the measured variable);
The series name;
The time scale; and
The geometry.

Arguably, we might want to repeat back the time-series that generated the events too (as part of the metadata), so there is an unambiguous connection between the two. However, that will increase the size of the payload. Alternatively, we could use a (practically) unique identifier for the content, like the hash of the content. By definition, the content is what determines the events.

We will also need to think about the message metadata (the stuff outside the payload), but that also depends on the protocol. For example, I think we'll want a job identifier of some kind, in order to tie the request to the response.