UUID rand_b value should not be a fixed random value

gregmedd commented 2 months ago

The Problem

The current UUID spec for uProtocol defines a UUIDv7-like format with one major change: the rand_b value will be generated once at startup for a uEntity and then reused for the lifetime of that uE.

This drastically reduces the uniqueness guarantees of these UUIDs in such a way that it may not be possible to distinguish between true message duplication and UUID reuse caused by time synchronization issues.

Background: Why is the uP UUID like this?

By fixing the rand_b field at a value generated when the uE starts, it is possible to derive uE lifecycle information from the UUIDs generated by that uE. If the rand_b field changes for a given uE, then we can reasonably assume it has restarted.

Background: Clock Sources and Time Synchronization

In general, we have two clocks we can choose from on Linux when we want to know the time: the system clock with UTC time (also called the "wall clock"), and the monotonic clock with a system-local monotonically counting representation of time. The system clock is what is used when agreement about time between multiple systems is needed. It is typically kept synchronized using network protocols, such as NTP or PTP, or through some timecode receiving hardware, such as a GNSS receiver or an IRIG capture device.

When connectivity to an upstream time source is lost (e.g. network connection lost or GNSS satellites not visible due to obstruction), the local system clock will free-run and drift over time due to instability in local frequency sources. In this free-running holdover mode, a system's clock could drift by several seconds over the course of a week.

When connectivity to an upstream time source is restored, the drift in the local system clock needs to be corrected. If the difference is small, this is usually done by subtly adjusting the clock frequency to slew the clock's time back into alignment. However, if the difference is large (on the order of seconds), this strategy will not work and the clock must be "jumped" to the correct time.

UUIDs and Uniqueness

As uProtocol UUIDs are defined today, reuse of UUIDs is practically guaranteed if the clock has run fast and must be jumped backwards to an earlier time. This is because the only fields that change for each newly generated UUID are the unix_ts_ms (time in milliseconds) and the counter (monotonically increases per message within that millisecond, resetting to zero at the top of each millisecond).

This poses a problem for several usage models of uProtocol. For the RPC model, these UUIDs are used to match replies to requests. For logging (and possibly playback) of messages, it's often necessary to be able to identify if a message is unique or if it is repeated. In the event of a backward time jump, it will no longer be possible to guarantee unique UUIDs and these uses can break down.

UUIDs and Clock Sources

In uProtocol, the time field on a UUID is used, in part, to enforce a message's TTL (time to live). When a message is received, the TTL is added to the UUID's time stamp then compared to the current system clock time; if the system time is earlier, then the message is still valid. As such, agreement about the current time is needed between uE.

This means it's not possible to switch to a monotonic clock for generating UUID values, even though this would guarantee time-ordering of messages regardless of what happens with the system clock. The monotonic clock's very nature is that it can always count upwards because it does not have to be the same value as any other clock. It cannot be synchronized with another system without introducing the same time jump issue as the system clock, violating its monotonicity in the process.

UUIDs and Ordering

While uProtocol itself makes no guarantee on the order of message delivery, there are scenarios where determining the transmission order of messages is necessary. One such example is capture and playback of uProtocol messages. Another is event logging.

In these scenarios, a clock-synchronizing jump to an earlier time will result in messages being sorted differently than their true transmission order. This could be solved for a single uE source by using a monotonic clock when generating UUIDs, but then the messages could not be sorted to other uE running on different systems (and TTL could not be enforced).

For applications on top of uProtocol that need ordering, a monotonic value could be simply embedded in the message payload. However, this does not cover the needs of logging, capture, and playback which need to operate at layer 1 or layer 2.

Proposals

The proposals below seek to address these issues:

Uniqueness of UUIDs
Globally correct ordering of messages / handling of time jumps
Tracking of uE lifecycle

Switch to UUIDv7

This proposal addresses the three issues with:

Guaranteeing uniqueness of UUIDs by switching to UUIDv7
Applications requiring ordering embed their own monotonic time values, while logging applications use their monotonic clock to record received times for use in detecting jumps
uE lifecycles are tracked as an application-layer resource

UUIDv7 is almost identical to the uProtocol UUIDv8 except that rand_b is required to be randomly generated each time a UUID is created. The rand_b field is sufficiently large such that a uE using a single pseudorandom number generator is unlikely to repeat a random value within the time scales were a time jump would be a concern. Even if time jumps backward, new messages can always be distinguished from repeated messages.

Applications are free to embed any data into their message payload to meet their needs. An application needing to know true transmission ordering can embed a monotonic timestamp into the message before it is sent. The receiving end of the application can use these values to detect time jumps, verify ordering of messages, and even to correlate monotonic time offsets between systems. Details of this would be left to application designers.

Logging and capture/playback uses could instead record the receiving system's monotonic time when each message arrives and store the two together. In post processing, this monotonic time can be used to detect time jumps in the UUIDs, assuming a) the synchronization jump threshold is known and b) the maximum network latency jitter is significantly less than the jump threshold. If the changes in the offset between the receive timestamp and the UUID timestamp are within the jump threshold, then we can reason that no jump has occurred. If the delta between the received timestamp and UUID timestamp changes by more than the jump threshold, we can assume a time jump has occurred.

uE lifecycles can be handled at the application layer. A standard uE health module could be made that periodically publishes uE uptime and other metrics such that restarts can be detected. Any uE that cares about the lifecycle of another uE could subscribe to these messages.

Steal some bits from `rand_b` for a monotonic timestamp

A smaller modification would be to steal some bits from the rand_b field to embed part of the sender's monotonic time in. It wouldn't need a huge number of bits, maybe enough for about 1 hour of range. This would a) result in unique UUIDs even if time jumps occur (so long as they are smaller than the range of the monotonic time field), b) assist in preserving absolute ordering for a given uE's messages, and c) preserve the existing uE lifecycle tracking by using a fixed random value for each start.

Since the rand_b field is unchanging through the lifecycle of a uE, and it does not significantly contribute to the uniqueness of the UUID, fewer bits can be allocated to this purpose. It only needs to be enough to provide

a reasonable degree of certainty that the value will be different from one launch of a uE to the next
enough range to avoid value collisions between uE on the same system.

stevenhartley commented 2 months ago

Since the rand_b field is unchanging through the lifecycle of a uE, and it does not significantly contribute to the uniqueness of the UUID, fewer bits can be allocated to this purpose. It only needs to be enough to provide

This is not true, the rand_b generation guarantees that al of the UUIDs from one uE is unique to another uE but I agree it doesn't' solve the time issue. Reusing portions of rand_b for monotonic clock would loose the entropy for the random numbers and we are back to potential collisions if uEs are gPTP clock synchronized and happen to have the (much smaller) random number generation that collides.

gregmedd commented 2 months ago

guarantees that al of the UUIDs from one uE is unique to another uE

This is true, but it's not a particularly valuable feature since messages contain other attributes identifying the source uE.

Reusing portions of rand_b for monotonic clock would loose the entropy for the random numbers

Again, this is true, but currently rand_b has 62 bits. That is a lot just to avoid collisions between uE. If we gave 20 bits to a monotonic timer, we would still have a rand_b space of around 4.4 trillion. And since the uE source of a message can easily be identified through other message attributes, a collision in a rand_b that only serves to identify a uE is probably not a major concern.

entropy for the random numbers

I specifically want to touch on this again: if the random numbers only change when the uE starts up, 62 bits of entropy isn't necessary. In UUIDv7. the 62 bits of entropy is intended to ensure global uniqueness across massively distributed databases with new values being generated potentially billions of times per second.

Personally, I lean towards the first of the two proposals (switching to UUIDv7). It makes the best use of the entropy in the rand_b field and avoids trying to encode more meaning into a UUID. All of the ordering, time jumps, and uE metrics concerns can be addressed at higher layers.

sophokles73 commented 2 months ago

but it's not a particularly valuable feature since messages contain other attributes identifying the source uE

In fact, each message has a source URI which identifies the uEntity that the message has been sent by, if I am not mistaken. So there should not be any need to derive the message origin from the UUID at all ...

gregmedd commented 2 months ago

It sounds to me like the best course of action is to switch to UUIDv7 and use the UUID solely¹ as a unique message identifier.

This does raise the question of how to enforce TTL. Wall clock time, as noted, is sensitive to time jumps and monotonic time cannot easily be shared across systems without implementing a new way of doing so.

¹ Capture and playback can also use the UUID timestamp for determining message ordering when combined with the receiver's monotonic time by monitoring the relative offset between the two timestamps for a given uE to detect time steps.

sophokles73 commented 2 months ago

At least within the vehicle the TTL issue could probably be addressed by requiring that all uEs sync with the vehicle's Time Master so that they all perceive the same time shift if the external source of current time is not available.

sophokles73 commented 2 months ago

It looks like gPTP has been designed to solve the issue of syncing the time within the vehicle. Assuming that off-vehicle communication will require an established network connection, it might be fair to say that in that case, the time in the vehicle will be synced using e.g. NTP?

gregmedd commented 2 months ago

...by requiring that all uEs sync with the vehicle's Time Master so that they all perceive the same time shift...

While that does help, neither gPTP¹ or NTP would guarantee that the correction step would occur simultaneously across all devices. And even if that could be done, any message in flight when the step occurs would still end up with the wrong time and incorrect TTL enforcement.

I think this comes down to what guarantees we can actually make in uProtocol. Do we guarantee delivery, or can messages be lost? Do we guarantee in-order delivery / time sortable messages, or is order not guaranteed? Do we guarantee (or require) perfect time synchronization between uE, or do we accept that might not be possible?

Many of these issues can be addressed at higher layers with a little work, so it might be fine to say uProtocol is best effort and let individual applications add the features they need on top of it.

¹ I'll admit the majority of my PTP experience predates gPTP, so maybe there is a way for it to synchronize the time steps across clients that I'm not aware of.

int0x27 commented 2 months ago

I was also concerned about UUID as timestamp for SOME/IP ttl.. My suggestion there was for local requests and ttl < 100ms to assume timestamp of receiving the request.

I suppose there is a way to detect if a request is local or remote (cloud event). It makes sense to check timestamp for remote requests, provided that ttl is big enough (some seconds at least) Btw some/ip request -> response roundtrip is ~17ms (both vsomeip on different hosts)

Regarding aggressive time synchronization, I have reservations about safety of setting time backwards. Some POSIX code using TIMER_ABSTIME may not handle it well, but probably it's not relevant for automotive grade Linux.

stevenhartley commented 2 months ago

completed and merged

eclipse-uprotocol / up-spec