IntelRealSense / librealsense

Intel® RealSense™ SDK
https://www.intelrealsense.com/
Apache License 2.0
7.55k stars 4.81k forks source link

Potential issue with time synchronization ("global time") when the camera is busy #9131

Open puzzlepaint opened 3 years ago

puzzlepaint commented 3 years ago
Required Info
Camera Model D415
Firmware Version 05.12.13.50
Operating System & Version Linux (Manjaro)
Kernel Version (Linux Only) 5.10.36
Platform PC
SDK Version 2.45.0
Language C++

Issue Description

The time synchronization mechanism ("global time") between camera and host time in librealsense works by regularly querying the camera clock and comparing it with the host clock. The host clock is queried before and after sending the command that queries the camera clock and receiving the response to that, such that a host time can be computed for roughly the same time at which the camera clock was queried.

This works most of the time, but there is a condition under which this fails (assuming the v4l2 backend; I haven't tested the other backends): Sometimes, in v4l_uvc_device::get_xu() in backend-v4l2.cpp which gets the device's response to the time-query command, the call xioctl(_fd, UVCIOC_CTRL_QUERY, &q) fails with errno set to EBUSY. In this case, the class retry_controls_work_around in backend.h sleeps for a quite long time before retrying: https://github.com/IntelRealSense/librealsense/blob/d5e775b75f7adbf75c911b7c85a4afd2f16defcb/src/backend.h#L446

where DELAY_FOR_RETRIES is set to 50: https://github.com/IntelRealSense/librealsense/blob/d5e775b75f7adbf75c911b7c85a4afd2f16defcb/src/backend.h#L32

The problem is that the response to the time-query command may actually be ready much faster than after these 50 milliseconds. If that happens, the host time that is queried afterwards will be wrong by up to 50 milliseconds in the worst case (if the device's reply was ready almost immediately, but the code waited for 50 milliseconds before retrieving it). This causes the "global time" time synchronization code to use a wrong sample for its computations.

Based on limited testing, I have the impression that with current camera firmware, xioctl(_fd, UVCIOC_CTRL_QUERY, &q) failing with EBUSY happens only when camera streaming just started, and seems to always work later on (when only streaming camera images and sending no other commands such as for example exposure setting changes). Still, it seems that it would be good to address this nevertheless to improve the global time reported in the beginning, and to make it more robust overall.

One way to address this would be to remove the 50-millisecond sleep on EBUSY and keep polling the result of the command immediately without sleeping. This sleep time generally seems rather large to me and I guess that it might also cause other commands to take longer than expected. However, removing it could lead to higher CPU utilization.

MartyG-RealSense commented 3 years ago

Hi @puzzlepaint Thanks very much for your highly detailed suggestions about global time!

Global Time was not originally part of the RealSense SDK and was introduced in May 2019.

https://github.com/IntelRealSense/librealsense/pull/3909

If problems with timestamps are encountered as a consequence of Global Time then it can be disabled by setting RS2_OPTION_GLOBAL_TIME_ENABLED to '0'. The link below provides an example of doing so with C++.

https://github.com/IntelRealSense/librealsense/issues/5067#issuecomment-544548652

In regard to EBUSY, a RealSense user once shared a C++ workaround for catching the error.

https://github.com/IntelRealSense/librealsense/issues/2850#issuecomment-445460696

puzzlepaint commented 3 years ago

Thanks @MartyG-RealSense for your response. The issue from #2850 has been fixed by #4264 as far as I see. I actually disabled global time as you mentioned, since I wrote my own time synchronization that syncs against std::chrono::steady_clock instead of std::chrono::system_clock, and only calibrates an offset value instead of an offset and factor value in the hope that it might be more robust (under the assumption that the clock drift is small). That seems to work fine so far.

The purpose of this issue report is just that I wanted to make the RealSense engineers aware of the technical issue I described in the original post, as I think that they might want to address it. I'll ping @doronhi, @dorodnic, @ev-mp here as authors and reviewers of the PR #3909 that introduced global time.

For reference, this is how I patched my fork of librealsense to avoid delayed USB responses and the resulting impact on time synchronization:

diff --git a/src/backend.h b/src/backend.h
index a402e3dfb..1f9db5c6f 100644
--- a/src/backend.h
+++ b/src/backend.h
@@ -431,7 +431,7 @@ namespace librealsense
                     if (_dev->set_xu(xu, ctrl, data, len))
                         return true;

-                    std::this_thread::sleep_for(std::chrono::milliseconds(DELAY_FOR_RETRIES));
+//                     std::this_thread::sleep_for(std::chrono::milliseconds(DELAY_FOR_RETRIES));
                 }
                 return false;
             }
@@ -443,7 +443,7 @@ namespace librealsense
                     if (_dev->get_xu(xu, ctrl, data, len))
                         return true;

-                    std::this_thread::sleep_for(std::chrono::milliseconds(DELAY_FOR_RETRIES));
+//                     std::this_thread::sleep_for(std::chrono::milliseconds(DELAY_FOR_RETRIES));
                 }
                 return false;
             }
diff --git a/src/linux/backend-v4l2.cpp b/src/linux/backend-v4l2.cpp
index ccf09879e..41e74b8b1 100644
--- a/src/linux/backend-v4l2.cpp
+++ b/src/linux/backend-v4l2.cpp
@@ -1167,10 +1167,11 @@ namespace librealsense
         {
             uvc_xu_control_query q = {static_cast<uint8_t>(xu.unit), control, UVC_GET_CUR,
                                       static_cast<uint16_t>(size), const_cast<uint8_t *>(data)};
-            if(xioctl(_fd, UVCIOC_CTRL_QUERY, &q) < 0)
+            while (xioctl(_fd, UVCIOC_CTRL_QUERY, &q) < 0)
             {
+                // std::cerr << "- get_xu() ioctl FAILED with errno " << (int)errno << " at " << (std::chrono::duration<double, std::milli>(std::chrono::steady_clock::now() - startTime).count()) << "\n";
                 if (errno == EIO || errno == EAGAIN || errno == EBUSY) // TODO: Log?
-                    return false;
+                    continue;

                 throw linux_backend_exception("get_xu(...). xioctl(UVCIOC_CTRL_QUERY) failed");
             }

However, this addresses the V4L2 backend only, and there may be different opinions on the tradeoff between faster response handling and CPU utilization. This is why I posted it as an issue rather than as a PR.

MartyG-RealSense commented 3 years ago

Thanks for the further detail @puzzlepaint - I do not think that there is anything that I myself can add to the discussion, though one of the RealSense developers that you tagged may wish to provide input.

MartyG-RealSense commented 3 years ago

Hi @puzzlepaint I have marked this case with an Enhancement tag as your kindly submitted Pull Request is associated with it. This case should therefore be kept open whilst the Pull Request is active. Thanks!