Synchronising/completeness data - how to know you have all messages

At our working group meeting on 25 July we discussed how a client GETting data from another device or a server can know if it has consumed all data, especially given that clocks of devices might not be perfectly synchronised.

I am creating this issue to capture the discussion so that it is documented for future contributors. I will post the emails between us as comments on the thread so contributors have the narrative that supports the decision we made at the meeting.

Question from Arjan Lamers, 22 July 2019: Hi all,

As discussed in the previous working group call, there is a need to define how an application can be sure it has consumed all data. Roughly we discussed a couple of possibilities. As promised, in this mail I outline scenario’s 0 (don’t specify anything), 1a and 1b based on the current message format (but with additional constraints) and 2 (introducing new fields). This can serve as input for a discussion in our next call Thursday.

0) don’t specify anything If we do not specify anything here, the standard is less practical. Each interaction between application and data source, where there needs to be some form of completeness guarantee, will have to be negotiated outside the standard.

1a) use existing definition First one is based on the current available fields. Most messages look like this:

"id": "string", "animal": {}, "eventDateTime": "string", "location": {}, "meta": { "source": "string", "modified": "string", "created": "string", "creator": "string", "validFrom": "string", "validTo": "string" }, ... <actual payload>

From application perspective: the application will need to keep track of what the latest ‘modified’ date X is for a given ’source’ Y. To query for all new events, the application should query with parameters like ’modified > X’ and ’source = Y’. From data source perspective: the modified date should be monotonically increasing. Even if there is a drift in hardware clocks, special care should be taken to make sure these timestamps never go back in time. Also, the scope for which the modified date works (which events are monotonically increasing) is tied to the ’source’. The ’source’ should be indicative of where the database is located which keeps track of the ‘modified’ date and the corresponding monotonic increasing field constraint. Depending on the complexity of the device, this could be a cloud database, an on premise database or even a device-local database.

Pro’s:

no changes / extensions to the message format
the ‘modified’ field has meaning so in case of data loss at application side, you can determine a functional rollback period

Con’s:

a data source can only use a datetime, row id’s or sequences are not possible
a data source must guarantee a monotonic increasing ‘modified’ date which may conflict with the actual ‘modified’ date
the ‘modified’ date is tightly coupled to the ’source’ field in terms of scope

1b) use existing definition with a defined max drift Alternative: we could allow a ‘maximum drift’, stating that ‘modified’ dates can have a maximum of (e.g.) 24 hours of delay. So, the application should query ‘modified > X-24hrs’ and ’source = Y’. Then, you will receive a maximum of 24hrs of duplicate events. These should be deduplicated by the application based on the ‘id’, again for the scope of ’source’.

Pro’s:

no changes / extensions to the message format
the ‘modified’ field has meaning so in case of data loss at application side, you can determine a functional rollback period

Con’s:

a data source can only use a datetime, row id’s or sequences are not possible
a data source must guarantee a monotonic increasing ‘modified’ date which may conflict with the actual ‘modified’ date
the ‘modified’ date is tightly coupled to the ’source’ field in terms of scope
the id must be unique for the ’source’ scope

2) make those fields explicit

From application perspective: the application will need to keep track of what the latest ‘offset’ X is for a given ’offsetScope’ Y. To query for all new events, the application should query with parameters like ’offset > X’ and ’offsetScope = Y’. From data source perspective: the offset should be monotonically increasing. The data source could still use a ‘modified’ date here with the same restrictions as above, but other options are also valid (like row id’s or sequences). As long as the field is comparable. The data source decides on the scope, so it can be independent of a legal source or technical source.

Pro’s:

a data source has more flexibility in how to guarantee the monotonic increasing requirement
no reuse of ‘modified’ field which can keep its own meaning

Con’s:

a few additional (potentially superfluous) mandatory fields
offset may or may not be interpretable for the application thus is not guaranteed to have meaning

Response from Andrew Cooke 24/25 July 2019: Dear all,

Here are some further comments on “synchronisation” or “knowing that you have got a complete data set”.

We are not trying to address full data synchronization that may include synchronisation of deletions, as this cannot be achieved through a GET that returns a set of objects. This doesn’t rule out systems implementing their own additional synchronisation or deductively identifying records that have been deleted (no longer returned).
“eventDateTime” and “modified” are two different concepts, so it is ok for servers to change the modified stamp if they modify some characteristics of a record, or to correct a device time, without changing the event date/time.
This has not been addressed by ADE in the past because most events are manually initiated by people or have a granularity where the comparatively low frequency of messages and the time between them has provided reasonable correction against inaccurate clocks at the point of measurement.

For this last reason, for most existing messages, I prefer approach 1a or 1b as explained by Arjan. We have helped a number of organisations implement similar rest-based data sharing and I usually recommend that approach 1b is followed – the consumer of data requests data from a provider that has been created or modified since 24 hours before the last request it made from that provider, and then resolves the duplicates as necessary.

However, when dealing with IOT devices, a more rapid flow of data, and/or communication between on-farm systems, this may not be sufficient. We do want the ADE data schema to support multiple uses, not just server to server communications with milk recording organisations.

Preliminary work by the Open Geospatial Organisation on OGC standards identified a similar need, and their working group has prototyped using synchronisation headers or data fields, called SYNC.SERVICEID and SYNC.CHECKPOINT, in association with a modified timestamp. This model assumes that each device or service keeps a set of tracked changes in its data model (inserts, updates, deletes), and “checkpoint” is an ID that is a pointer to the current (or a previous) end point of that set of tracked changes. In practice, I don’t see any IOT devices or many on-farm systems keeping such lists of their own tracked changes (unless they do replication), but the concept is not unlike option 2 below. Up until now, OGC have been using “resultTime” (equivalent of our “modified”) as their method of querying data.

I consider that inaccurate device clocks will not be such a large issue in the future. Almost all IOT communication frameworks (LPWAN, LORAWAN, 5G) require accurate clock synchronisation to support network communications, so it is built into their protocols, and internet connected devices mostly use network time services. The main challenge is current in-field devices that have manually maintained time settings. So while I believe we should make best efforts to support solving this problem, we should not introduce too much complexity, and be careful about creating mandatory fields that cannot be readily filled by existing systems.

If we are to support option 2 below, I favour a separate “Sync” sub-object with SourceID and SyncOffset fields to make it clear. If we are to make this mandatory (and it is probably only useful if it is reliably available), then we should make it clear that existing systems can map “Source” and “modified” fields into SourceID and SyncOffset.

During the meeting we discussed the potential to use an interval/period query filter rather than an absolute date/time filter when requesting data.

For instance, if the current time by your computer's clock was 08:00 UTC and you had last asked for data 3 hours ago, you could ask for data by absolute time (modified since 05:00 UTC). However, if the clock on the other computer did not match, there could be messages that you do not get.

However, if you ask for data for the duration "current time - 03:00", then the other computer could use its clock and still return the correct set of data.

This does not adjust for manual changes to clock, which the guaranteed sequential offset method in method 2 above does address, but is otherwise very elegant.

Here is Arjan's very good summary of the meeting outcome:

Summarizing the workgroup meeting on this ’synchronisation’ / ‘completeness’ decision. Let me know if I misinterpreted something or if there are comments!

The WG decided to opt for scenario 1a. The ‘modified’ and ’source’ fields in the meta data will be compulsory. Data sources are required to make sure that the ‘modified’ datetime is monotonically increasing (cannot go back in time). This should not be a problem for devices that synchronise with a central cloud, nor for devices with a local clock. The datetime is already in UTC so we do not expect problems with DST or time zones. The client can thus keep track of the latest ‘modified’ datetime it has received per ’source’, and use that as a starting point to query.

There is a scenario that we do not cover in this specification: in case a device with a local clock has drifted too much, an operator may decide to reset its clock either forward or backward. If it is reset backward in time, clients may miss data recorded in that correction period. In case of such a hard reset, clients should be prepared to recapture a larger period in time. How to detect this is out of the scope of the spec and assumed a manual proces, similar to reparations needed with hardware failure or other kinds of data loss. In this scenario, the client should be able to rely on the ‘eventId’ being unique for the ’source’.

The url scheme should thus allow to query for ‘modified >= x and source = y’. Alternatively, a client could query with ‘modified in last z hours and source = y’. The latter does not require the client to keep the modified date but has to take into account a possible drift between the clock of the source and the clock of the client.

The WG also discussed the possibility of allowing for a tolerance period (scenario 1b). No default could be found that is both reasonably efficient as well as allowing all vendors to guarantee it, the perceived benefit of defining a period is low.

The WG also discussed future extensions: if messages (or rather, devices) are defined for which scenario 1a cannot be implemented, the standard could define a tolerance period (scenario 1b) for those message, or the standard could be extended with specific synchronization fields (a Sync group as Andrew suggested below). This will be considered only when these messages present itself with a use case.

If this summary is approved, we’ll need to update the spec to make the fields compulsory.

adewg / ICAR

Synchronising/completeness data - how to know you have all messages #35