FraunhoferIOSB / FROST-Server

A Complete Server implementation of the OGC SensorThings API
https://fraunhoferiosb.github.io/FROST-Server/
GNU Lesser General Public License v3.0
194 stars 70 forks source link

Is it possible to get the average of observation results within a time interval? #26

Open tobi238 opened 6 years ago

tobi238 commented 6 years ago

I'm looking for a query option to aggregate the result values of multiple observations within a time intervall. For example I would like to get the average observation result value of all observations from a specific day.

I would think of a query to look something like that: http://localhost:8080/SensorThingsServer-1.0/v1.0 /Datastreams(123) /Observations ?$select=avg( result ) &$filter=during( phenomenonTime, 2017-10-25T15:00:00.000Z/2017-10-26T15:00:00.000Z )

Is it possible to do something similar to that?

Many thanks in advance, Tobias

hylkevds commented 6 years ago

Unfortunately, aggregation functions are not part of the SensorThings API. Time functions (like during) are under discussion: https://github.com/opengeospatial/sensorthings/issues/2

JoeThunyathep commented 6 years ago

Hey @tobi238 , I am doing the same thing by using $filter= phenomenonTime ge .... and phenomenonTime le .... then from the response, you get the "@iot.count". Then you can simply write the code to sum up all your results and then divided by "@iot.count". :)

tobi238 commented 6 years ago

Hey @Thunyathep, yes this is working. To save some performance and bandwidth, it might be a good idea to also add $top=0, so you don't get all observations if your request data for a longer time interval. The full request looks like this:

http://localhost:8080/SensorThingsServer-1.0/v1.0/Observations?$filter=during(%20phenomenonTime,%202017-10-20T15:00:00.000Z/2017-10-26T15:00:00.000Z%20)&$top=0

riedel commented 5 years ago

As I think aggregation support in crucial in the long run and is specified by an odata extension, I opened an issue on it https://github.com/opengeospatial/sensorthings/issues/71

hylkevds commented 5 years ago

You could make a prototype to see if it works as expected. A demonstration of function helps a lot when trying to get things standardised.

riedel commented 5 years ago

We were discussing today if we can make it a thesis at the university

. However, before any implementation there should be at least some rough consensus about possible syntax and semantics IMHO.

After discussion with my colleagues I would argue additionally to the normal average towards a non-standard timeseries average with two arguments, both value and interval/timestamp. This is particularly necessary to average observations with variable PhenomenonTime intervals.

However it is unclear to me how to denote this using the OData Extension for Data Aggregation even with custom functions.

Also Functions applied before grouping seem to be not part of the standard. My Intuitive definition of a standard timeseries aggregation would be:

https://example.com/v1.0/Datastreams(1)/Observations?$filter=during( phenomenonTime, 2017-10-25T15:00:00.000Z/2017-10-26T15:00:00.000Z&apply=groupby(time_bucket(phenomenonTime,'10m'), aggregate((result,phenomenonTime) with st_average as average))

or

in the simple example above:

http://localhost:8080/SensorThingsServer-1.0/v1.0/Datastreams(123)/Observations
?$apply=aggregate((result, phenomenonTime) with st_average as avg)
&$filter=during( phenomenonTime, 2017-10-25T15:00:00.000Z/2017-10-26T15:00:00.000Z )

or if you consider only point measurements

http://localhost:8080/SensorThingsServer-1.0/v1.0/Datastreams(123)/Observations
?$apply=aggregate(result with average as avg)
&$filter=during( phenomenonTime, 2017-10-25T15:00:00.000Z/2017-10-26T15:00:00.000Z )

The last one would be the most simple one to implement, but this actually makes only sense for periodic or random point sampling. Although I believe this covers 80% of all practical cases...

st_average I would define as the linear interpolation between intervals and points. Alternatively only intervals could be counted, then any aggregation of single values would be ignored. Or only single values get linearly interpolated.

actually some magic is further needed for during that actually cuts interleaving intervals internally, the same should be done for time_bucket so that a PhenomenonTime 2017-10-24T23:00:00.000Z/2017-10-25T23:00:00.000Z should be cut into 2017-10-25T15:00:00.000Z/2017-10-25T23:00:00.000Z internally before aggregation

Seems to actually a bit ugly. BTW: during as I understand it would be also wrong in the above example because it doesn't get the first value correctly if its an interval, shouldn't it be anyinteracts to calculate the average correctly

Because this is a little bit difficult to define consistently I would rather do a definition first. I know its is common these days to do defacto standardization by taking demos as input to standards, but I do not like this. I would rather like to define a consistent definition that can be the basis for an implementation.

hylkevds commented 5 years ago

This gets complicated quickly :) Some small questions for understanding:

First, taking a little step further back: In your brainstorm examples, we're doing queries on the Observations collection, but what are we getting back? I guess virtual Observations that do not have an ID, but only result and phenomenonTime, and are formatted the same otherwise? Or would it be a completely differently formatted result set?

If I understand correctly, the first example (with time_bucket function) would also work without filter. In that case it would return an entire Datastream, with 10 minute average result values. In contrast, the second and third example would require the user to do a separate request for each 10 minute interval.

riedel commented 5 years ago

The return is defined by the OData Data Aggregation Specification .

Yes, the first one works without filter. The second and third would be rather meaningless without the filter, but should theoretically also work without.

The complicated part is only if you consider time intervals (could be a second step) . I like that sensorthings supports intervals but it makes things rather complicated (e.g. the IMHO wrong use of during in the example above, shows that most users don't think about intervals in the first place). I realized that actually I know no timeseries database that uses them. I now grasp why... (although they are theoretically great).

The standard for aggregation without it seems to be implemented by some Microsoft and SAP Services: e.g. https://docs.microsoft.com/en-us/azure/devops/report/extend-analytics/aggregated-data-analytics?view=azure-devops the basic stuff is also implemented in Olingo

hylkevds commented 5 years ago

MultiDatastreams will also be an issue, where result is an array of values. And all other cases where result is not a simple numeric value.

Of course, since time series databases are optimised for a very specific function, and OData is an interface for generic relational databases, they are on two opposite ends of a spectrum. Do we need to be able to aggregate Things or ObservedProperties? I'm starting to think it might be better to come up with a completely new definition rather than trying to use the OData aggregation specification. I don't really see many users building those aggregation queries...

An initial list of features to consider:

riedel commented 5 years ago

My gut feeling is that it would be nice to stay/become compliant with OData, because of Tool support (Olingo and .NET provide client libraries, that will only work in a compliant setting). I also think a lot less documentation is needed if the syntax stays compliant.

In a full OData compliant setting you can specify which properties you allow to aggregate (this requires metadata support, however). I see, however, the problem that you are right: some time-series stuff is a bit awkward.

riedel commented 5 years ago

IMHO The "best" semantics for an average is actually sum(interval.length(t)x)/sum(interval.length(t)) and interval length(t) should default to epsilon for a timestamp and to a unitless interval length for any interval (If there is only epsilons the sum would be count()epsilon and the sum sum(epsilon*x))

Practically you first look for any intervals in the aggregate if there are you use a weighted sum of those or if there are only timestamps you would use a standard average.

We first thought of using linear interpolations (could be a second custom function), however, this makes it really stupid for overlapping intervals. Actually, to be honest, the result in the above case for overlaps is well-defined, however, makes actually not too much sense. For rolling averages, the correct average function would need to weight all overlaps by the number of measurements overlapping. This is normally no problem, because you have a equal number of overlaps in most use cases (thus can use above function). But I have to admit that there are pathological cases. But I still think that an API only need to be well-defined ;) .

Beyond custom aggregation functions (3.1.3.6) looking further at the OData Aggregation specification one could actually easily define custom aggregates (6.2.3 Custom Aggregates) (like a time weighted average)

You can even force that the average is always used in the context of certain grouping parameters (6.2.4 , you could force using the unit to make it even more complicated :) ). It might be difficult however to force using a window function with variable using this odata mechanism (length, offset like in the expression using floor above).

I still believe it would be of great value to get this right....

dirk-ecker commented 2 years ago

Hey @tobi238 , I am doing the same thing by using $filter= phenomenonTime ge .... and phenomenonTime le .... then from the response, you get the "@iot.count". Then you can simply write the code to sum up all your results and then divided by "@iot.count". :)

The problem is you can not count on the @iot.count as stated in

9.3.3.4 $count

Clients should be aware that the count returned inline may not exactly equal the actual number of items returned, due to latency between calculating the count and enumerating the last value or due to inexact calculations on the service.