IntelRealSense / librealsense

Intel® RealSense™ SDK
https://www.intelrealsense.com/
Apache License 2.0
7.43k stars 4.8k forks source link

D435 UVC Issue(s) #11947

Closed Simon-McD closed 1 year ago

Simon-McD commented 1 year ago

Required Info
Camera Model D435
Firmware Version 5.15.0.2
Operating System & Version Ubuntu 20.0.4.2 LTS
Kernel Version (Linux Only) 5.4.0-152-generic
Platform PC
SDK Version 2.54.1
Language C/C#
Segment Agritech

I was unsure whether to post this here since I use precompiled libs installed through apt. I submitted an enquiry through Intel Zendesk but they told me to upgrade to 5.8 kernel because Ubuntu no longer supports 5.4. Upgrading the kernel is a worst case scenario for us because our systems are on site across Europe and running headless. The risk of losing remote access and requiring a visit is a worry. Also, the 2.54 release page here on Github suggests that 2.54 is compatible with a 5.4 kernel.

RealSense libs are installed pre-built through apt:

librealsense2-dbg/focal,now 2.54.1-0~realsense.9590 amd64 [installed]
librealsense2-dkms/focal,now 1.3.19-0ubuntu1 all [installed]
librealsense2-gl/focal,now 2.54.1-0~realsense.9590 amd64 [installed]
librealsense2-udev-rules/focal,now 2.54.1-0~realsense.9590 amd64 [installed,automatic]
librealsense2-utils/focal,now 2.54.1-0~realsense.9590 amd64 [installed]
librealsense2/focal,now 2.54.1-0~realsense.9590 amd64 [installed]

Something about the USB subsystem/D435 interaction seems to be crashing our software with a seg fault. We use try_wait_for_frames with a one second timeout. dmesg is full of UVC errors:

uvcvideo: Failed to query (GET_CUR) UVC control 1 on unit 3: -32 (exp. 1024).
uvcvideo: Failed to query (GET_CUR) UVC control 4 on unit 3: -32 (exp. 2).
uvcvideo: Failed to query (GET_CUR) UVC control 2 on unit 3: -32 (exp. 1).
uvcvideo: Non-zero status (-71) in video completion handler.
uvcvideo: Failed to query (GET_CUR) UVC control 1 on unit 3: -32 (exp. 1024).
uvcvideo: Non-zero status (-71) in video completion handler.
uvcvideo: Failed to query (GET_CUR) UVC control 1 on unit 3: -32 (exp. 1024).
uvcvideo: Failed to query (GET_CUR) UVC control 1 on unit 3: -32 (exp. 1024).
uvcvideo: Failed to query (GET_CUR) UVC control 1 on unit 3: -32 (exp. 1024).
uvcvideo: Failed to query (GET_CUR) UVC control 1 on unit 3: -32 (exp. 1024).
uvcvideo: Failed to query (GET_CUR) UVC control 1 on unit 3: -32 (exp. 1024).
uvcvideo: Failed to query (GET_CUR) UVC control 1 on unit 3: -32 (exp. 1024).
uvcvideo: Failed to query (GET_CUR) UVC control 1 on unit 3: -32 (exp. 1024).
MyBinaryName[40937]: segfault at 18 ip 00007f8da2375fc4 sp 00007f8d8d7284e8 error 4 in libpthread.so.0[7f8da2371000+11000]
Code: 7e 8f 45 31 d2 ba 01 00 00 00 be 01 00 00 00 48 89 ef b8 ca 00 00 00 0f 05 e9 73 ff ff ff e8 13 b7 ff ff 0f 1f 00 f3 0f 1e fa <8b> 47 10 89 c2 81 e2 7f 01 00 00 90 83 e0 7c 75 7b 53 48 83 ec 10
uvcvideo: Failed to query (GET_CUR) UVC control 1 on unit 3: -32 (exp. 1024).
uvcvideo: Failed to query (GET_CUR) UVC control 4 on unit 3: -32 (exp. 2).
uvcvideo: Failed to query (GET_CUR) UVC control 2 on unit 3: -32 (exp. 1).
uvcvideo: Non-zero status (-71) in video completion handler.
uvcvideo: Failed to query (GET_CUR) UVC control 1 on unit 3: -32 (exp. 1024).
uvcvideo: Non-zero status (-71) in video completion handler.

When I run a gdb backtrace of the core dump against a debug version of our binary I see:

#0  0x00007f8da2375fc4 in pthread_mutex_lock () from /home/myusername/lib/libpthread.so.0
#1  0x00007f8da854c81d in __gthread_mutex_lock (__mutex=0x8)
    at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
#2  __gthread_recursive_mutex_lock (__mutex=0x8) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:811
#3  std::recursive_mutex::lock (this=0x8) at /usr/include/c++/9/mutex:106
#4  std::lock_guard<std::recursive_mutex>::lock_guard (__m=..., this=<synthetic pointer>)
    at /usr/include/c++/9/bits/std_mutex.h:159
#5  el::base::RegisteredLoggers::get (this=0x0, id=..., forceCreation=false)
    at ./third-party/easyloggingpp/src/easylogging++.cc:1894
#6  0x00007f8da854ccbd in el::base::Writer::initializeLogger (this=0x7f8d8d728940, loggerId=..., 
    lookup=<optimized out>, needLock=<optimized out>) at ./third-party/easyloggingpp/src/easylogging++.h:2608
#7  0x00007f8da854d085 in el::base::Writer::construct (this=this@entry=0x7f8d8d728940, count=count@entry=1, 
    loggerIds=loggerIds@entry=0x7f8da858c4c6 "librealsense") at /usr/include/c++/9/ext/new_allocator.h:80
#8  0x00007f8da84eca02 in librealsense::uvc_sensor::<lambda(librealsense::platform::stream_profile, librealsense::platform::frame_object, std::function<void()>)>::operator()(librealsense::platform::frame_object, std::function<void()>) (
    __closure=0x564ea6b64360, f=..., continuation=..., p=...) at /usr/include/c++/9/bits/stl_vector.h:94
#9  0x00007f8da84eda9e in std::_Function_handler<void(librealsense::platform::stream_profile, librealsense::platform::frame_object, std::function<void()>), librealsense::uvc_sensor::open(const stream_profiles&)::<lambda(librealsense::platform::stream_profile, librealsense::platform::frame_object, std::function<void()>)> >::_M_invoke(const std::_Any_data &, librealsense::platform::stream_profile &&, librealsense::platform::frame_object &&, std::function<void()> &&) (
    __functor=..., __args#0=..., __args#1=..., __args#2=...) at /usr/include/c++/9/bits/move.h:182
#10 0x00007f8da843a082 in std::function<void (librealsense::platform::stream_profile, librealsense::platform::frame_object, std::function<void ()>)>::operator()(librealsense::platform::stream_profile, librealsense::platform::frame_object, std::function<void ()>) const (__args#2=..., __args#1=..., __args#0=..., this=0x564ea6671390)
    at /usr/include/c++/9/bits/std_function.h:683
#11 librealsense::platform::v4l_uvc_device::upload_video_and_metadata_from_syncer (this=0x564ea6671230, buf_mgr=...)
    at ./src/linux/backend-v4l2.cpp:1579
#12 0x00007f8da843bf8f in librealsense::platform::v4l_uvc_device::poll (this=0x564ea6671230)
    at ./src/linux/backend-v4l2.cpp:1505
#13 0x00007f8da843cfc0 in librealsense::platform::v4l_uvc_device::capture_loop (this=0x564ea6671230) at ./src/linux/backend-v4l2.cpp:1937
#14 0x00007f8da225fde4 in ?? () from /home/myusername/lib/libstdc++.so.6
#15 0x00007f8da2373609 in start_thread () from /home/myusername/lib/libpthread.so.0
#16 0x00007f8da1f4a163 in clone () from /home/myusername/lib/libc.so.6

I haven't applied any kernel patches. Hopefully it is as simple as doing that. All sage advice warmly welcomed!

Simon-McD commented 1 year ago

Correction: C/C++

MartyG-RealSense commented 1 year ago

Hi @Simon-McD I have not heard of kernel 5.4 becoming unsupported by Ubuntu, though the link below states that 5.4 has an end of life date of December 2025.

https://www.kernel.org/releases.html#longterm

Applying a kernel patch when building librealsense from source code is not compulsory but can cause problems, such as hardware metadata timestamps not being supported.

If you are reliant on a particular kernel and updating it would be the worst case scenario, you can avoid this by building librealsense from source code with CMake with the build flag -DFORCE_RSUSB_BACKEND=true to bypass the kernel so that librealsense can operate without kernel patching and without dependence on a particular Linux version or kernel version. RSUSB is best suited to single-camera projects though, with a kernel-patched build being ideal for applications that use multiple cameras.

Simon-McD commented 1 year ago

MartyG, thanks for your prompt reply. These are single, standalone camera setups. I guess I will have to try and build myself with the kernel bypass option enabled. This may take a while.

MartyG-RealSense commented 1 year ago

I note that you are currently installing packages with apt instead of building from source code. With package installation it is not necessary to apply a kernel patch if you are using one of the listed supported kernels because the patch is bundled in the packages.

Simon-McD commented 1 year ago

I have built the sources successfully on my dev machine having followed installation.md. What is the best way to install this remotely on a test machine that has pre-built libs installed via apt (as described and listed in my initial post)?

MartyG-RealSense commented 1 year ago

If you can access a test machine on a remote connection then you could uninstall the existing package-based installation of librealsense using the Ubuntu terminal command below.

dpkg -l | grep "realsense" | cut -d " " -f 3 | xargs sudo dpkg --purge

You could then repeat your source-code installation procedure on the test machine over the remote connection.


An alternative approach may be to create an iso disk image of your entire test computer and then write that iso file to the test computer to create a copy of your dev machine. If you need to install to multiple computers then using an iso image should be faster and easier than individually building librealsense on every machine.

Simon-McD commented 1 year ago

Ok, I have it on a test machine. I will let it run for a day or two and check dmesg for both UVC errors and seg faults in our binary. Looks fine so far, not a single UVC error. Before the new lib the errors started immediately our app was launched. I will go quiet for a couple of days and return with updates later. Please nudge me if you feel I have left the issue dangling too long.

Thank you so much for your help!

MartyG-RealSense commented 1 year ago

You are very welcome. If I do not hear from you then I will check with you for an update in a week from now. Good luck!

Simon-McD commented 1 year ago

MartyG, we have good and bad news. The good news is that the RSUSB_BACKEND build has completely removed all the UVC errors that we were seeing in dmesg. The bad news is that the seg faults still occur and take our software down. It seems the two may not be related. Unfortunately, it may be necessary to focus on the backtrace of the core dump that I left in my initial post and understand why it is crashing. Sorry.

Simon-McD commented 1 year ago

In the 18 hours or so since I left the test running we had five crashes. Is there any mileage in leaving RS Viewer running with the same streams we would normally have in our software and with logging switched on? I also plan to roll back to 2.51 on another test system to check if it was our switch from 2.51 to 2.54 that introduced the issue. Are these reasonable steps? Are there other steps I could consider?

Simon-McD commented 1 year ago

I am now unable to revert my dev machine to 2.51. After purging all librealsense libs and removing the downloaded repo and rebooting, then cleaning my binary project and rebuilding I get a SIGABORT with the following RS error: "terminate called after throwing an instance of 'rs2::invalid_value_error' what(): API version mismatch: librealsense.so was compiled with API version 2.51.1 but the application was compiled with 2.54.1! Make sure correct version of the library is installed (make install)"

However, these are the libs apt knows to be installed:

librealsense2-dbg/focal,now 2.51.1-0~realsense0.7526 amd64 [installed,upgradable to: 2.54.1-0~realsense.9590] librealsense2-dev/focal,now 2.51.1-0~realsense0.7526 amd64 [installed,upgradable to: 2.54.1-0~realsense.9590] librealsense2-dkms/focal,now 1.3.19-0ubuntu1 all [installed] librealsense2-gl/focal,now 2.51.1-0~realsense0.7526 amd64 [installed,upgradable to: 2.54.1-0~realsense.9590] librealsense2-net/focal,now 2.51.1-0~realsense0.7526 amd64 [installed,upgradable to: 2.53.1-0~realsense0.8250] librealsense2-udev-rules/focal,now 2.51.1-0~realsense0.7526 amd64 [installed,upgradable to: 2.54.1-0~realsense.9590] librealsense2-utils/focal,now 2.51.1-0~realsense0.7526 amd64 [installed,upgradable to: 2.54.1-0~realsense.9590] librealsense2/focal,now 2.51.1-0~realsense0.7526 amd64 [installed,upgradable to: 2.54.1-0~realsense.9590]

and these are the libs that "locate librealsense2.so" can find:

/home/myusername/lib/librealsense2.so.2.51 /home/myusername/libQt6_3/librealsense2.so.2.50 /home/myusername/pretendHome/home/myusername/lib/librealsense2.so.2.50 /home/myusername/pretendHome/home/myusername/lib/librealsense2.so.2.51 /home/myusername/temp/hv/lib/librealsense2.so.2.50 /home/myusername/temp/socv/home/myusername/lib/librealsense2.so.2.50 /usr/lib/x86_64-linux-gnu/librealsense2.so /usr/lib/x86_64-linux-gnu/librealsense2.so.2.51 /usr/lib/x86_64-linux-gnu/librealsense2.so.2.51.1

Using ldd on my binary with grep librealsense gives

librealsense2.so.2.51 => /lib/x86_64-linux-gnu/librealsense2.so.2.51 (0x00007fee6c707000)

I have no idea what is persisting a dependence on 2.54, despite the library not being on the machine. I can't build a usable binary, please help.

MartyG-RealSense commented 1 year ago

Instead of using the Viewer log, it may be more relevant to your project to enable verbose (more detailed) UVC driver logging in dmesg using the instructions at the link below.

https://dev.intelrealsense.com/docs/troubleshooting#uvc-video-module-traces

Looking at your uvcvideo log, the error segfault at 18 ip is extremely rare and I could find only one other case in which it has occurrred. A RealSense team member at https://github.com/IntelRealSense/librealsense/issues/2029#issuecomment-414712195 suggested adding a hardware_reset() mechanism to reset the camera when the script is run in order to deal with this error.

MartyG-RealSense commented 1 year ago

If you installed librealsense from packages then use the purge command mentioned earlier.

If librealsense was built from source code then go to the 'build' directory of the 2.51.1 source code folder (if it still exists) and input the command below to uninstall librealsense and clean up CMake:

sudo make uninstall && make clean

If neither of those approaches work then a clean wipe and reinstall of the test computer is likely the only remaining option to clear the problem.

Simon-McD commented 1 year ago

Ok, thanks.

Simon-McD commented 1 year ago

Praise be! sudo make uninstall && make clean for the win! I now have a test machine running 2.51 and I can flip between self-build and pre-built librealsense2 on the dev machine, sincere thanks, MartyG!

Right, back to hunting the crash bug. If I understand you correctly you wnat me to reset the device and relaunch our app after a segfault. We already check for our process in the process list and relaunch it if it isn't there (e.g. after the seg fault), which seems to work most of the time, but we are missing things under the cam that we should be analysing and the times when the relaunch doesn't work are very problematic.

MartyG-RealSense commented 1 year ago

The hardware_reset() instruction only resets (disconnects and reconnects) the camera hardware, not the application. So it would not be helpful for recovering from a segfault. Performing a hardware_reset() when the application is first run, before the error occurs, may improve the camera's stability. It sounds as though you already have a recovery mechanism implemented though.

Looking at your backtrace log, I observe that it begins on the first line with a pthread_mutex_lock message. As a general rule for debugging, I start with the first line of the log in case an error at the start causes a cascade failure that registers errors with components further down the log that do not actually have a problem. This helps to avoid investigating false leads.

In a segmentation fault case at https://github.com/IntelRealSense/librealsense/issues/10112 with the pthread_mutex_lock message, I recommended checking for the possibility of a memory leak in the application.

Simon-McD commented 1 year ago

To my understanding, frame #0, the first line in the backtrace, is the last code executed before the seg fault ( https://sourceware.org/gdb/onlinedocs/gdb/Backtrace.html ). So the first call in the backtrace that is attributable is:

in librealsense::platform::v4l_uvc_device::capture_loop (this=0x564ea6671230) at ./src/linux/backend-v4l2.cpp:1937

Which calls the line above itself in the trace, and so on. This is a single-threaded app so the origin of the pthread_mutex_lock issue is ultimately librealsense::platform::v4l_uvc_device::capture_loop. I will look inside backend-v4l2.cpp tomorrow.

MartyG-RealSense commented 1 year ago

Thanks very much. I look forward to your next report. Good luck!

Simon-McD commented 1 year ago

I have found the reason for our app crash and, you will be pleased to hear, it has nothing to do with librealsense2 nor UVC errors. It seems that the process that launches and monitors our app switched to a different Linux user at the same time we migrated to 2.54 🤦 This different user is the basis of the crash. I believe the reason it always happened while librealsense2 is polling the camera/waiting for a frame is because that is where it spends most of its time in our framewise handling tight loop.

Sorry for wasting your time, but we can console ourselves that I have learned quite a lot during this conversation. I can ignore the dmesg UVC errors (as I always used to) and I can now easily switch between pre-built libs and self-built (using the FORCE_RSUSB_BACKEND switch if I want to get rid of the UVC errors). Many thanks for your patient and helpful support, MartyG, you may now close this issue 👍

MartyG-RealSense commented 1 year ago

It's no trouble at all. I'm pleased that you found the cause and are satisifed with the outcome of the issue. as you suggest, i will close it. Thanks very much for the update! :)