Time sync issue with uxr_sync_session()

erikboto commented 6 months ago

I'm using an ESP32 with the micro-ROS component using a custom transport (serial port), and in my application I do a periodical rmw_uros_sync_session() call with a rather short time-out. The idea is that I want to make sure the clock doesn't drift during longer recordings, but I also keep the timeout short so it doesn't interfere with all my sensor sampling, basically it's not a big thing if the sync fails every once in a while since the periodic sync is mainly there to correct any drift.

But, every now and then I notice that we suddenly got a jump backwards in time. Looking into this in more detail I found that it happens some times after a successful sync following an unsuccessful one (unsuccessful due to timeout), but far from every time. Another thing I noticed is that the jump backwards in time was always half the time I set between my sync attempts.

After having a quick look at the code I would guess that this is caused by having an unprocessed timestamp_reply due to timing out before it's processed, since the calculations done in process_timestamp_reply() would cause this "sync period / 2" delay. Then on the next sync, the old reply is then processed and the new reply is left for the next sync() causing the time-shift to be persistent even after more syncs.

Not sure how this can be fixed best. Perhaps there's some way to invalidate previous timestamp replies when issue a new sync request, or add a id field to the request so that we can verify that we process the correct reply when calculating the session->time_offset.

pablogs9 commented 6 months ago

Could you provide a quick replicator?

erikboto commented 6 months ago

Sure, you can use this repo https://github.com/erikboto/microros-sync-bug-trigger.

The example uses a really low timeout (3ms) since it triggers the bug on the first run, so it's easy to reproduce. In our actual application it's 5 ms, where it works most of the times but just fails sometimes. Using 5 ms here will always work so the bug is never triggered, since there's not a lot of interrupts or other threads running.

I'm using a https://www.olimex.com/Products/IoT/ESP32/ESP32-DevKit-LiPo/open-source-hardware board, but I guess any HW will result in the same thing. On this board I'm using the "USB-uart" for microROS, and then I set up the log output to a UART on one of the GPIO pins.

The microROS agent I run using the docker images: docker run -it --rm -v /dev:/dev -v /dev/shm:/dev/shm --privileged --net=host microros/micro-ros-agent:humble serial --dev /dev/ttyUSB1 -b 921600 -v3

pablogs9 commented 6 months ago

Why do you need to sync time so quickly? This approach was designed to calculate drift between clocks and then assume a certain stability for the drift variance.

Which is your use case?

erikboto commented 6 months ago

Well this minimal replicator isn't really representative of how we are using it, here it is taken to an extreme just to replicate the issue. The reason I want to use a short timeout is that the system also samples sensors at high sample rates so I want to make sure the sync affects this as little as possible. So the real use-case is just to be able to call sync() with a short timeout, but with a long time between sync() calls, without risking that future sync() calls trigger this bug where an old reply is parsed causing the round-trip time calculations to be way off.

My understanding was that the sync() call measures the round-trip time, which it uses to set the offset from the received timestamp. But is there some periodical timestamp message coming from the microROS agent even if I don't call sync() again, that it can use to avoid long term drift? Or how does it calculate the drift? (I'm by no means a time-sync expert so I might be missing something that is obvious to you).

pablogs9 commented 6 months ago

This approach uses an NTP-like algorithm for calculating the time offset between the MCU and the Agent's clocks. This is not a high-precision approach that targets to provide a valid (POSIX-like) time reference to an MCU that probably boots with a POSIX timestamp set to zero.

Ideally, the sync procedure shall be run once a startup and it will be used internally in Micro XRCE-DDS Client to offset the timestamp as you can see here: https://github.com/eProsima/Micro-XRCE-DDS-Client/blob/0c6743ffa358f26ca9433e951c534ec2f96be37a/src/c/core/session/session.c#L1194.

One approach that you can use is configuring the on_time callback in the session using this uxr_set_time_callback and control when you want to use or reject the wire protocol timing info. You have an example here: https://github.com/eProsima/Micro-XRCE-DDS-Client/blob/master/examples/TimeSyncWithCb/main.c

In any case, as you are right and time sync responses shall not be processed out of sync call, I have prepared this PR: https://github.com/eProsima/Micro-XRCE-DDS-Client/pull/382 Could you take a look and validate that it fixes your use case?

Finally you have some documentation about this feature here: https://micro-xrce-dds.docs.eprosima.com/en/latest/time_sync.html

erikboto commented 6 months ago

I had a quick look at the commit and I don't think it will solve the issue I see, which is that the old reply is read the next time I call sync(). So in my case session->waiting_for_time_sync will be true but it will read the old timestamp reply instead of the new one. But I will test it anyway to confirm.

Yeah using a custom on_time() callback could definitely work to make sure I don't set the time wrong, but it would still leave me in state where I always parse the reply from the previous sync() call which means that the new reply gets left for the next time. So I can't ever "catch up" and process the extra timestamp reply that is left in some buffer (at least that's what I think is happening).

pablogs9 commented 6 months ago

You are right, replies from different requests still can cause problems.

It seems that the wire protocol is not designed to handle this because the definition of the sync request and reply does not provide any kind of request_id so we cannot differentiate between responses to different requests.

eProsima / Micro-XRCE-DDS-Client

Time sync issue with uxr_sync_session() #381