sarfata commented 5 years ago

Summary

Propose to create a new SignalK data format that can be used to convey data about multiple SignalK objects (vessels, aton, aircraft, etc) over a period of time.

Motivation

Multiple developers are working on history related features and there is a strong interest in agreeing on one common data format. Discussion was started and continues on Slack #history-api.

In this proposal, I suggest a new format for the following reasons:

The delta format can only carry information about one context;
The delta format is very verbose in this use-case because it needs to provide at least one update object and value object for each timestamp represented. It also repeats the key names for each timestamp;
The client needs to apply a lot of logic on a delta object to figure out:
- Which values (path) are available
- Which sources are available for each value
- Get a list of values for a given path

This format can be used as both a "log" file format and a format served over HTTP from a server.

Detailed design

History data format

The history object provides essential information about the data included:

startDate / endDate
the identifier of self if it is known (it is not required because in some cases we do not know)
an optional generator key to include information on who generated this file (especially useful for logfiles)

  "version": "1.1.0",
  "startDate": "2018-10-06T04:00:00Z",
  "endDate": "2018-10-06T04:00:02Z",

And then a list of objects. Each object is identified by a context like we already do on delta objects but then for each we provide:

A list of timestamp
A list of properties
- Each property has a path and a source (this way we can have data from multiple sources)
- A list of values which must have the same length as the list of timestamps

  "objects": [
    {
      "context": "vessels.urn:mrn:xxx",
      "timestamps": [ "2018-10-06T04:00:00Z", "2018-10-06T04:01:00Z", "2018-10-06T04:00:02Z"],
      "properties": [
        {
          "path": "navigation.position",
          "source": { "label": "NMEA1" },
          "values": [
            { "longitude": -182.2, "latitude": -42.1},
            { "longitude": -182.2, "latitude": -42.1}
          ]
        }
      ]
    }
  ]

A few important notes:

The timestamps may be different for each object. They just need to be within the bounds of the file. That is because in many situation we will not have the same amount of information on all objects (an AIS target may have been visible only for a few minutes) and having only one list of timestamps for the entire file would force us to include a lot of nulls.
The length of the timestamps array must be equal to the length of the properties.*.values arrays.
If a value is not available for a given timestamp, it can be null in the values array.

History REST endpoint

(WIP - Need more work on this)

Drawback

This is one more format.

Alternatives

See the slack #history-api archive or https://docs.google.com/document/d/1s4_lHVVyKJlfacpq5LcUPQEZHvU5nSVtdtE0BxSbbSw/edit# for a summary of other proposals discussed.

Using delta format

Very verbose
Only one context at a time
Harder to use for a client

Using geo-json

Not very appropriate when the navigation.position key is not included

tkurki commented 5 years ago

We should have a way to specify the resampling per period length

what samples the client requests: min/max/average/first/last and a way to have for example min and max for navigation.speedOverGround for each time period
period length / time slice, in query and in response

tkurki commented 5 years ago

See also #89 History/time series api and #363 Track API.

rob42 commented 5 years ago

Are you sure this is actually much less verbose? Looking at it with a human eye is misleading as we pretty print and compare. When you remove whitespace and CR/LF the effect is different.

Waaay back I did a size comparison of delta vs sparse full format, there was very little difference, mostly around how much of the path was common to the values. Compression gives us a huge size reduction because of the repetition.

But we do need a format for history.

BTW we should avoid calling the playback and snapshot functionality 'history'. Its becoming confusing since they are likely to be different api's and formats. eg 1) 'Playback' - extension to /signalk/v1/stream to replay data 2) 'Snapshot' - extension to /signalk/v1/api to retrieve data at a point in time 3) 'History' - new api to get bulk historic data by date-range.

sarfata commented 5 years ago

what samples the client requests: min/max/average/first/last and a way to have for example min and max for navigation.speedOverGround for each time period

I do recognize that this is useful and I have just tried a few formats that feel very contrived. I think the best way might be to use the source for this.

For example:

  {
          "path": "navigation.speedOverGround",
          "source": { "label": "aggregation.max", type: "max", "originalSource": { "label": "NMEA1" } },
          "values": [ 12.2, 12.1, 10.9 ]
  }

I think this fits well into our general model. Thoughts?

Notes:

We still need to discuss how to request it
Samples easily map to timestamps but with aggregation, we will need to clarify whether the samples apply to the interval between two timestamps (in which case there should be one less value than timestamps); or "at" the timestamp. Need to look into how influxdb does this more.

period length / time slice, in query and in response

Yes for the query.

For the response, it would be great to know the time slice so you can directly access time t with values[(t - startTime) / period] but that means we really enforce the fact that all the timestamps are precisely following the period. Are we ready to include that requirement? Maybe it's an optional field but when it's provided timestamp n must be equal to startTime + n * period?

If we decide to do this, I will extend the validation to actually verify this too (also need to check that they are all in chronological order and that they are not repeated).

sarfata commented 5 years ago

Are you sure this is actually much less verbose? Looking at it with a human eye is misleading as we pretty print and compare. When you remove whitespace and CR/LF the effect is different.

Waaay back I did a size comparison of delta vs sparse full format, there was very little difference, mostly around how much of the path was common to the values. Compression gives us a huge size reduction because of the repetition.

So I added support for 'history' format (very WIP) to my conversion tool in strongly-signalk and did some tests using some real log files people have sent me during charted sails development.

File	Original size	Original size + GZ	Size in SK Delta	Size in SK Delta + GZ	Size in SK History	Size in SK History + GZ
Velocitek logfile (.vcc which is GPX like) (9 hours)	1.7 MB	208k	3.4 MB	312kB	1.3MB	236 kB
Log from Cassiopeia (1 hour SignalK log with 176k updates / 217k values)	29 MB	3.2 MB	33 MB	1.5 MB	28 MB	1.1 MB
Log from tkurki in SK (38mn SignalK log with 37k updates / 65k values)	7.1 MB	916 kB	8.1 MB	372kB	5.8 MB	280 kB
Expedition san-francisco.csv example 16k updates/92k values	1.4 MB	328kB	8.1MB	632kB	2.7MB	316kB

Conclusions:

The best gains are obtained through compression. No doubt about this.
History format as proposed here is always smaller than delta, both before and after compression and by a significant amount (30% to 70% reduction).
CSV is not a bad format for what we want to do here... (but it would be hard to include sources, multiple boats, etc)

Would be interesting too to measure the size 'in memory' of the file in different engines. I know how to do this with Chrome but not with NodeJS. If anyone has ideas, let me know!

rob42 commented 5 years ago

In memory size will quickly overrun the little RPi :-(

The history implementation should be using streaming, as we can easily produce arbitrarily large datasets. That doesnt mean ws etc, just streaming internally so memory stays tiny.

If we reply with a gzipped stream of updates (via http) then the format problem is also solved. Size is excellent, and we already have the handling code.

sarfata commented 5 years ago

Unless we agree on the use-cases, I think we can all be right at the same time and still disagree on the best solution. Reading comments here, I think we have very different scenarios in mind. We need to clarify what use-cases we are trying to solve for so we can make a decision: What are the types of apps that are consuming this history format? What are they doing with it?

My main use-case is displaying how data series change over time on a map or a graph. To do this, I need all the data in memory at once and I need to be able to quickly find the min/max of the value (for the scale of a graph for example) and quickly access data in an index format. This is very expensive to do with the delta format and that is why I would be happy to see a different format (such as the one I proposed in this PR) that would be much easier to consume directly.

tkurki commented 5 years ago

I agree with @sarfata, we should start from real world use cases.

We can also first build some of the applications, at the risk of doing things a few times over, and then come back to the spec issue once the most predrominant use cases have been worked out.

tkurki commented 5 years ago

We had a lengthy discussion with Rob over Slack, where the main point (from my point of view at least) was that to enable efficient stream-based handling the format should have self contained units: for example having first a list of timestamps and then the corresponding data forces the consumer to collate data from different parts of the dataset. Instead putting the related items together allows more efficient processing in most circumstances.

Too bad GeoJSON does not allow extending the coordinates...

rob42 commented 5 years ago

My additional use case is to collect track, depth and other info within a bounding box and display on the chart. Same for wind, sog, etc for polar comparisons, engine performance comparisons, etc Also data export to the cloud with intermittent connection. @tkurki since you need the full message to make the array in ram work, you could just ingest the stream and build the array locally?

tkurki commented 5 years ago

To me the whole point of having a history API is to provide fast and convenient access to historical data with different aggregation options.

Digesting the original delta stream does not fit any of those criteria.

rob42 commented 5 years ago

Cross-posting the format being discussed:

[
    {"2015-03-07T12:37:10.523+13:00":[
            {"vessels.urn:mrn:imo:mmsi:234567890":[
                    { "navigation.position": {
                                    "$source": "a.suitable.path",
                                    "average":{
                                        "longitude": 24.9025173,
                                        "latitude": 60.039317
                                    }
                                }
                        },
                        "environment.depth.belowSurface": {
                                    "$source": "a.suitable.path",
                                    "avg":2.5,
                                    "max": 2.8,
                                    "min": 2.5
                                    }
                ]
            },
            ...more vessels
        ]
    },
    ...more timestamps
]

rob42 commented 5 years ago

I'm open to other formats that: 1) are 'packetised' so we can write/read them with low resources and they are resilient when partially sent/recieved over intermittent links. 2) Handle multi-vessels 3) Handle complex combinations of paths, nulls, missing data, etc

rob42 commented 5 years ago

A thought here: a lot of the discussion is about the use case, and what suits it. Obviously thats different for different use cases. I think we should concentrate a the best format to transfer data, not to process data.

This format could be used to dump data for backup, transfer data between signalk instances, consolidate data to a different timeslice, or return data for a history query. It should handle bad connections, potentially huge data sizes, and multi-vessel/complex query responses. In the delta format we only considered now(), so it was keyed on the vessel. In this case the natural key is timestamp, hence the natural self-contained unit or packet is one timeSlice

The production of data, and the clients use of the data is actually implementation detail. There are so many use-cases that we cant optimise a generic format for any specific one. If we really need to do that, then we should have a specific API for that use-case. aka /track/

tkurki commented 5 years ago

It sounds like you are after a transfer format and we should create it separate from a more use case driven format.

sbender9 commented 5 years ago

I agree with @tkurki , seems like we should have a separate API for the transfer use cases.

gdavydov commented 5 years ago

So, how this format will looks like? Like @rob42 mentioned above?

fabdrol commented 5 years ago

@sarfata @rob42 @tkurki this seems stale. In any case there are conflicts. Please update or we should close and revisit at a later time. Thoughts?

rob42 commented 5 years ago

I think this is still useful. While the PR is stale now, the issue is going to come up again as soon as we use history in more complex ways. It also relates to #543 since both need a high volume and very efficient transfer format

tkurki commented 5 months ago

Closing as stale, to be implemented as OpenApi description in the future, see https://github.com/SignalK/signalk-server/pull/1653

SignalK / specification

[WIP] feature: add a new history format and history REST API #513

Summary

Motivation

Detailed design

History data format

History REST endpoint

Drawback

Alternatives

Using delta format

Using geo-json