IntelRealSense / librealsense

Intel® RealSense™ SDK
https://www.intelrealsense.com/
Apache License 2.0
7.53k stars 4.81k forks source link

SIGSEGV from rs2_config_resolve() after hardware reset #11304

Closed jomiham closed 1 year ago

jomiham commented 1 year ago
Required Info
Camera Model D435
Firmware Version 05.14.00.00
Operating System & Version Linux (Ubuntu 22.04.1)
Kernel Version (Linux Only) 5.15.0
Platform PC
SDK Version 2.53.1
Language kotlin / java / jna
Segment others

Issue Description

I am migrating a system from an older setup with ubuntu 18.04 to a new with 22.04 and am having intermittent trouble with the librealsense integration.

The system uses two D435:s to get camera and depth feeds and the setup process is basically this:

  1. Setup shared context
  2. Setup pipelines/configs per camera and enable streams
  3. Resolve depth and color devices
  4. Create/get depth and color sensors
  5. Restart each camera using rs2_hardware_reset() + rs2_delete_device() (to get to a known state)
  6. Wait a bit (500ms)
  7. Resolve devices and sensors again (retry until successful)
  8. Set options and start everything up...

What I'm seeing is that often, in step 7 rs2_config_resolve() will fail with an error code. I assume this is some timing issue after the reset and usually everything starts working after one or two automatic retries.

However, sometimes one of the retries will throw a SIGSEGV and core dump. (Note: For the sample program below, I only have one camera so it is not due to multiple cameras)

From what I can see, it seems that the problem is on this line https://github.com/IntelRealSense/librealsense/blob/v2.53.1/src/ds5/ds5-device.cpp#L733

since all_device_infos does not contain any device with mi==0 (se stack trace below).

Changing how long I wait after reset seems to have an effect and 500ms is arbitrary based on how quickly the camera usually pops back up on the USB bus, but given that reset time could vary with temperatures etc. I need to find a "safe" solution.

What am I missing here? It feels like the system has happened on a timing edge case that triggered a bug. But apart from the SIGSEGV bug, I would really appreciate any hint if there is a better way of doing the setup that will ensure that the system always starts in the same state every time without risking timing issues? On Ubuntu 18, we used SDK v2.30 and I am starting to suspect that we have had the same timing issue there, but with different symptoms since the code that triggers the SIGSEGV seems to be added after v2.30...

Core dump stack trace

(gdb) where
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140709449373248) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140709449373248) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140709449373248, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ff979d81476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ff979d677f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ff9791194b7 in ?? () from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#6  0x00007ff979a5d211 in ?? () from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#7  0x00007ff9798aa5f4 in JVM_handle_linux_signal () from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#8  0x00007ff97989d6ac in ?? () from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#9  <signal handler called>
#10 0x00007ff978e44e1c in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007ff956fb99d6 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign (__str=..., this=<optimized out>, this=<optimized out>, __str=...)
    at /usr/include/c++/11/bits/basic_string.h:1387
#12 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator= (__str=..., this=<optimized out>, this=<optimized out>, __str=...)
    at /usr/include/c++/11/bits/basic_string.h:681
#13 librealsense::info_container::register_info (this=0x7ff974670cd0, info=<optimized out>, val=<error reading variable: Cannot access memory at address 0x50>) at ./src/sensor.cpp:766
#14 0x00007ff956cce27a in librealsense::ds5_device::create_depth_device (this=0x7ff97465ae18, ctx=..., all_device_infos=std::vector of length 1, capacity 1 = {...})
    at ./src/ds5/ds5-device.cpp:733
#15 0x00007ff956cd46dc in librealsense::ds5_device::ds5_device (this=<optimized out>, __vtt_parm=<optimized out>, ctx=..., group=..., this=<optimized out>, __vtt_parm=<optimized out>,
    ctx=..., group=...) at ./src/ds5/ds5-device.cpp:756
#16 0x00007ff956cfd30b in librealsense::rs435_device::rs435_device (this=<optimized out>, ctx=..., group=..., register_device_notifications=<optimized out>, this=<optimized out>,
    ctx=..., group=..., register_device_notifications=<optimized out>) at ./src/ds5/ds5-factory.cpp:547
#17 0x00007ff956d04bac in __gnu_cxx::new_allocator<librealsense::rs435_device>::construct<librealsense::rs435_device, std::shared_ptr<librealsense::context>&, librealsense::platform::backend_device_group&, bool&> (this=<optimized out>, __p=0x7ff97465aab0) at /usr/include/c++/11/ext/new_allocator.h:162
#18 std::allocator_traits<std::allocator<librealsense::rs435_device> >::construct<librealsense::rs435_device, std::shared_ptr<librealsense::context>&, librealsense::platform::backend_device_group&, bool&> (__p=0x7ff97465aab0, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:516
#19 std::_Sp_counted_ptr_inplace<librealsense::rs435_device, std::allocator<librealsense::rs435_device>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::shared_ptr<librealsense::context>&, librealsense::platform::backend_device_group&, bool&> (__a=..., this=0x7ff97465aaa0) at /usr/include/c++/11/bits/shared_ptr_base.h:519
#20 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<librealsense::rs435_device, std::allocator<librealsense::rs435_device>, std::shared_ptr<librealsense::context>&, librealsense::platform::backend_device_group&, bool&> (__a=..., __p=@0x7ff978beef00: 0x0, this=0x7ff978beef08) at /usr/include/c++/11/bits/shared_ptr_base.h:650
#21 std::__shared_ptr<librealsense::rs435_device, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<librealsense::rs435_device>, std::shared_ptr<librealsense::context>&, librealsense::platform::backend_device_group&, bool&> (__tag=..., this=0x7ff978beef00) at /usr/include/c++/11/bits/shared_ptr_base.h:1342
#22 std::shared_ptr<librealsense::rs435_device>::shared_ptr<std::allocator<librealsense::rs435_device>, std::shared_ptr<librealsense::context>&, librealsense::platform::backend_device_group&, bool&> (__tag=..., this=0x7ff978beef00) at /usr/include/c++/11/bits/shared_ptr.h:409
#23 std::allocate_shared<librealsense::rs435_device, std::allocator<librealsense::rs435_device>, std::shared_ptr<librealsense::context>&, librealsense::platform::backend_device_group&, bool&> (__a=...) at /usr/include/c++/11/bits/shared_ptr.h:863
#24 std::make_shared<librealsense::rs435_device, std::shared_ptr<librealsense::context>&, librealsense::platform::backend_device_group&, bool&> ()
    at /usr/include/c++/11/bits/shared_ptr.h:879
#25 librealsense::ds5_info::create (this=<optimized out>, ctx=..., register_device_notifications=<optimized out>) at ./src/ds5/ds5-factory.cpp:1097
#26 0x00007ff956cef6d2 in librealsense::device_info::create_device (this=<optimized out>) at ./src/context.h:48
#27 0x00007ff956f7b79b in librealsense::device_hub::create_device (this=0x7ff97430e020, serial="834412071623", cycle_devices=false) at ./src/device_hub.cpp:83
#28 0x00007ff956f7cabc in librealsense::device_hub::wait_for_device (this=0x7ff97430e020, timeout=..., loop_through_devices=false, serial="834412071623") at ./src/device_hub.cpp:128
#29 0x00007ff956ee97eb in librealsense::pipeline::pipeline::wait_for_device (this=<optimized out>, timeout=..., serial="834412071623") at ./src/pipeline/pipeline.cpp:162
#30 0x00007ff956eed603 in librealsense::pipeline::config::resolve_device_requests (this=0x7ff9745a90a0, pipe=..., timeout=...) at ./src/pipeline/config.cpp:303
#31 0x00007ff956eed7bc in librealsense::pipeline::config::resolve (this=0x7ff9745a90a0, pipe=std::shared_ptr<librealsense::pipeline::pipeline> (use count 3, weak count 1) = {...},
    timeout=...) at ./src/pipeline/config.cpp:189
#32 0x00007ff956f9d8b5 in rs2_config_resolve (config=<optimized out>, pipe=<optimized out>, error=0x7ff974645db0) at ./src/rs.cpp:2058
#33 0x00007ff957c16052 in ?? ()
#34 0x0000000000000007 in ?? ()
#35 0x00007ff978bf06d0 in ?? ()
#36 0x00007ff978befbc0 in ?? ()
#37 0x00007ff957c14f4c in ?? ()
#38 0x00007ff978bf09e0 in ?? ()
#39 0x00007ff97400a000 in ?? ()
#40 0x00007ff900000000 in ?? ()
#41 0x00007ff978bef9f0 in ?? ()
#42 0x00007ff978bf0640 in ?? ()
#43 0x0000000000000000 in ?? ()

(gdb) frame 14
#14 0x00007ff956cce27a in librealsense::ds5_device::create_depth_device (this=0x7ff97465ae18, ctx=..., all_device_infos=std::vector of length 1, capacity 1 = {...})
    at ./src/ds5/ds5-device.cpp:733
733 ./src/ds5/ds5-device.cpp: No such file or directory.

(gdb) info locals
backend = <optimized out>
depth_devices = std::vector of length 0, capacity 0
timestamp_reader_backup = std::unique_ptr<librealsense::frame_timestamp_reader> = {get() = 0x0}
timestamp_reader_from_metadata = <optimized out>
timestamp_reader_metadata = std::unique_ptr<librealsense::frame_timestamp_reader> = {get() = 0x0}
enable_global_time_option = std::shared_ptr<librealsense::global_time_option> (use count 2, weak count 0) = {get() = 0x7ff9746463f0}
raw_depth_ep = std::shared_ptr<librealsense::uvc_sensor> (use count 2, weak count 1) = {get() = 0x7ff974659700}
depth_ep = std::shared_ptr<librealsense::ds5_depth_sensor> (use count 1, weak count 1) = {get() = 0x7ff974670820}

(gdb) p all_device_infos
$1 = std::vector of length 1, capacity 1 = {{id = "/dev/video4", vid = 32902, pid = 2823, mi = 3, unique_id = "2-1-30",
    device_path = "/sys/devices/pci0000:00/0000:00:14.0/usb2/2-1/2-1:1.3/video4linux/video4", serial = "", conn_spec = librealsense::platform::usb3_2_type, uvc_capabilities = 69206017,
    has_metadata_node = false, metadata_node_id = ""}}

Sample

I don't know if it helps, but this is the basic sample program that sometimes recreates the core dump (maybe one in 10/20 runs):

fun ce(error: LongByReference) {
    val value = error.value
    // Setting back the error.value is important since we use a global error variable (maybe we should change that).
    // Consecutive calls to rs2_XXX DOES NOT set this value to 0L on success, it only sets a value on error. Therefor we must set it to 0L
    // somewhere (here) if we want to be able to continue to execute after we get an error on an rs2_XXX call.
    error.value = 0L
    if (value != 0L) {
        throw RealsenseException("Realsense error: " + librealsense2.rs2_get_failed_function(value) + "(" + librealsense2.rs2_get_failed_args(value) + "): " + librealsense2.rs2_get_error_message(value))
    }
}

private fun findSensors(dev: Pointer): Pair<Pointer, Pointer> {
    val error = LongByReference()
    val list = rs2_query_sensors(dev, error)
    ce(error)
    val n = rs2_get_sensors_count(list, error)
    ce(error)
    var color = Pointer(0)
    var depth = Pointer(0)
    for (i in 0 until n) {
        val sensor = rs2_create_sensor(list, i, error)
        ce(error)

        if (rs2_is_sensor_extendable_to(sensor, ExtensionType.RS2_EXTENSION_DEPTH_SENSOR, error)) {
            depth = sensor
        } else {
            color = sensor
        }
        ce(error)
    }
    rs2_delete_sensor_list(list)

    return Pair(color, depth)
}

fun getDevice(config: Pointer, pipeline: Pointer): Pointer? {
    val error = LongByReference()

    println("Resolving profile for config...")
    var profile = librealsense2.rs2_config_resolve(config, pipeline, error)
    var cnt = 0
    while (error.value != 0L && cnt < 20) {
        println("!!! Resolving profile for config => ERROR=${error.value}")
        Thread.sleep(20)
        error.value = 0L
        println("Resolving profile for config...")
        profile = librealsense2.rs2_config_resolve(config, pipeline, error)
        cnt++
    }

    println("Resolving device for profile...")
    val device = librealsense2.rs2_pipeline_profile_get_device(profile, error)
    ce(error)

    if (device != null) {
        println("Found device @$device")

        val name = librealsense2.rs2_get_device_info(device, librealsense2.CameraInfo.RS2_CAMERA_INFO_NAME, error)
        ce(error)

        println(" name: $name")

        val usbPortId = librealsense2.rs2_get_device_info(device, librealsense2.CameraInfo.RS2_CAMERA_INFO_USB_TYPE_DESCRIPTOR, error)
        ce(error)

        println(" port: $usbPortId")
    }

    return device
}

fun resetCamera(deviceD: Pointer, deviceC: Pointer) {
    val error = LongByReference()

    println("Performing hardware reset...")
    librealsense2.rs2_hardware_reset(deviceD, error)
    ce(error)

    println("Deleting devices...")
    librealsense2.rs2_delete_device(deviceD)
    librealsense2.rs2_delete_device(deviceC)
}

fun main() {
    val error = LongByReference()

    println("Creating context...")
    val context = librealsense2.rs2_create_context(API_VERSION, error)
    ce(error)

    println("Creating pipelineD...")
    val pipelineDepth = librealsense2.rs2_create_pipeline(context, error)
    ce(error)

    println("Creating configD...")
    val configDepth = librealsense2.rs2_create_config(error)
    ce(error)

    println("Creating pipelineC...")
    val pipelineColor = librealsense2.rs2_create_pipeline(context, error)
    ce(error)

    println("Creating configC...")
    val configColor = librealsense2.rs2_create_config(error)
    ce(error)

    println("Enabling streams...")
    librealsense2.rs2_config_enable_stream(configDepth, librealsense2.Stream.RS2_STREAM_DEPTH, -1, 848, 480, librealsense2.Format.RS2_FORMAT_Z16, 60, error)
    ce(error)

    librealsense2.rs2_config_enable_stream(configDepth, librealsense2.Stream.RS2_STREAM_INFRARED, -1,848, 480, librealsense2.Format.RS2_FORMAT_Y8, 60, error)
    ce(error)

    librealsense2.rs2_config_enable_stream(configColor, librealsense2.Stream.RS2_STREAM_COLOR, -1, 640, 480, librealsense2.Format.RS2_FORMAT_RGB8, 30, error)
    ce(error)

    println("Enabling devices for serial...")
    librealsense2.rs2_config_enable_device(configDepth, "834412071623", error)
    ce(error)
    librealsense2.rs2_config_enable_device(configColor, "834412071623", error)
    ce(error)

    var deviceDepth = getDevice(configDepth, pipelineDepth)
    var sensorDepth = findSensors(deviceDepth!!).second
    var deviceColor = getDevice(configColor, pipelineColor)
    var sensorColor = findSensors(deviceColor!!).first

    resetCamera(deviceDepth!!, deviceColor!!)

    Thread.sleep(500)

    // SIGSEGV from first rs2_config_resolve() call in getDevice() below 
    deviceDepth = getDevice(configDepth, pipelineDepth)
    sensorDepth = findSensors(deviceDepth!!).second
    deviceColor = getDevice(configDepth, pipelineDepth)
    sensorColor = findSensors(deviceColor!!).first

    if (deviceDepth != null && sensorDepth != null && deviceColor != null && sensorColor != null)
        println("SUCCESS!!!")
}

A successful run can look like this:

Creating context...
Creating pipelineD...
Creating configD...
Creating pipelineC...
Creating configC...
Enabling streams...
Enabling devices for serial...
Resolving profile for config...
Resolving device for profile...
Found device @native@0x7fe4745c32e0
 name: Intel RealSense D435
 port: 3.2
Resolving profile for config...
Resolving device for profile...
Found device @native@0x7fe47463daa0
 name: Intel RealSense D435
 port: 3.2
Performing hardware reset...
Deleting devices...
Resolving profile for config...
!!! Resolving profile for config => ERROR=140619181701472
Resolving profile for config...
!!! Resolving profile for config => ERROR=140619182202288
Resolving profile for config...
Resolving device for profile...
Found device @native@0x7fe47466a6b0
 name: Intel RealSense D435
 port: 3.2
Resolving profile for config...
Resolving device for profile...
Found device @native@0x7fe4746cf890
 name: Intel RealSense D435
 port: 3.2
SUCCESS!!!

And a SIGSEGV like this:

Creating context...
Creating pipelineD...
Creating configD...
Creating pipelineC...
Creating configC...
Enabling streams...
Enabling devices for serial...
Resolving profile for config...
Resolving device for profile...
Found device @native@0x7f4c385c3040
 name: Intel RealSense D435
 port: 3.2
Resolving profile for config...
Resolving device for profile...
Found device @native@0x7f4c3863d820
 name: Intel RealSense D435
 port: 3.2
Performing hardware reset...
Deleting devices...
Resolving profile for config...
!!! Resolving profile for config => ERROR=139965340038848
Resolving profile for config...
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f4c3f547e1c, pid=1174624, tid=0x00007f4c3f2f4640
#
# JRE version: OpenJDK Runtime Environment (8.0_352-b08) (build 1.8.0_352-8u352-ga-1~22.04-b08)
# Java VM: OpenJDK 64-Bit Server VM (25.352-b08 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libstdc++.so.6+0x14be1c]  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x2c
#
# Core dump written. Default location: /home/system/Dash/core or core.1174624
#
# An error report file with more information is saved as:
# /home/system/Dash/hs_err_pid1174624.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted (core dumped)

/var/log/syslog

If everything runs without any errors, I get a clean reset in /var/log/syslog:

Jan  9 13:23:20 elvis kernel: [13439.940961] usb 2-1: USB disconnect, device number 28
Jan  9 13:23:20 elvis kernel: [13440.304988] usb 2-1: new SuperSpeed USB device number 29 using xhci_hcd
Jan  9 13:23:20 elvis kernel: [13440.325381] usb 2-1: New USB device found, idVendor=8086, idProduct=0b07, bcdDevice=50.e0
Jan  9 13:23:20 elvis kernel: [13440.325386] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Jan  9 13:23:20 elvis kernel: [13440.325387] usb 2-1: Product: Intel(R) RealSense(TM) Depth Camera 435
Jan  9 13:23:20 elvis kernel: [13440.325388] usb 2-1: Manufacturer: Intel(R) RealSense(TM) Depth Camera 435
Jan  9 13:23:20 elvis kernel: [13440.325390] usb 2-1: SerialNumber: 836313020547
Jan  9 13:23:20 elvis kernel: [13440.327924] usb 2-1: Found UVC 1.50 device Intel(R) RealSense(TM) Depth Camera 435  (8086:0b07)
Jan  9 13:23:20 elvis kernel: [13440.329892] input: Intel(R) RealSense(TM) Depth Ca as /devices/pci0000:00/0000:00:14.0/usb2/2-1/2-1:1.0/input/input170
Jan  9 13:23:20 elvis kernel: [13440.330357] usb 2-1: Found UVC 1.50 device Intel(R) RealSense(TM) Depth Camera 435  (8086:0b07)

But when I get an error code back from rs2_config_resolve() (regardless of if it will SIGSEGV or not), I see some extra warnings:

Jan  9 13:23:37 elvis kernel: [13457.616417] usb 2-1: USB disconnect, device number 29
Jan  9 13:23:37 elvis kernel: [13457.616526] xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Jan  9 13:23:37 elvis kernel: [13457.666291] xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Jan  9 13:23:37 elvis kernel: [13457.666396] usb 2-1: Failed to query (SET_CUR) UVC control 1 on unit 3: -108 (exp. 1024).
Jan  9 13:23:38 elvis kernel: [13457.940439] usb 2-1: new SuperSpeed USB device number 30 using xhci_hcd
Jan  9 13:23:38 elvis kernel: [13457.960850] usb 2-1: New USB device found, idVendor=8086, idProduct=0b07, bcdDevice=50.e0
Jan  9 13:23:38 elvis kernel: [13457.960854] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Jan  9 13:23:38 elvis kernel: [13457.960856] usb 2-1: Product: Intel(R) RealSense(TM) Depth Camera 435
Jan  9 13:23:38 elvis kernel: [13457.960857] usb 2-1: Manufacturer: Intel(R) RealSense(TM) Depth Camera 435
Jan  9 13:23:38 elvis kernel: [13457.960859] usb 2-1: SerialNumber: 836313020547
Jan  9 13:23:38 elvis kernel: [13457.963460] usb 2-1: Found UVC 1.50 device Intel(R) RealSense(TM) Depth Camera 435  (8086:0b07)
Jan  9 13:23:38 elvis kernel: [13457.965699] input: Intel(R) RealSense(TM) Depth Ca as /devices/pci0000:00/0000:00:14.0/usb2/2-1/2-1:1.0/input/input171
Jan  9 13:23:38 elvis kernel: [13457.966216] usb 2-1: Found UVC 1.50 device Intel(R) RealSense(TM) Depth Camera 435  (8086:0b07)
MartyG-RealSense commented 1 year ago

Hi @jomiham Which method did you use to install the librealsense SDK, please? If you built it from source code with CMake, have you patched the kernel? And if you have patched it, did you use the patch script called patch-realsense-ubuntu-lts-hwe.sh as this script is used for patching kernel 5.15 on Ubuntu 22.04 Jammy.

jomiham commented 1 year ago

Hi @MartyG-RealSense, The sdk etc is installed with apt-get install (but I saw now that librealsense2-udev-rules is behind for some reason):

librealsense2-dbg/jammy,now 2.53.1-0~realsense0.8251 amd64 [installed]
librealsense2-dev/jammy,now 2.53.1-0~realsense0.8251 amd64 [installed]
librealsense2-dkms/jammy,now 1.3.19-0ubuntu1 all [installed]
librealsense2-gl/jammy,now 2.53.1-0~realsense0.8251 amd64 [installed]
librealsense2-net/jammy,now 2.53.1-0~realsense0.8251 amd64 [installed,automatic]
librealsense2-udev-rules/now 2.51.1-0~realsense0.7526 amd64 [installed,upgradable to: 2.53.1-0~realsense0.8251]
librealsense2-utils/jammy,now 2.53.1-0~realsense0.8251 amd64 [installed]
librealsense2/jammy,now 2.53.1-0~realsense0.8251 amd64 [installed,automatic]
MartyG-RealSense commented 1 year ago

Okay, so the kernel patch should be already bundled within the packages of 2.53.1. Thanks very much for the information.

The udev rules handle devices in Ubuntu.

If the problem occurs after hardware reset as the issue title indicates, this could be because a camera is sometimes not re-detected at the start of the second half of a reset after it has been disconnected in the first half of the reset. This could be because of a USB related problem.

jomiham commented 1 year ago

If the problem occurs after hardware reset as the issue title indicates, this could be because a camera is sometimes not re-detected after it has been disconnected in the first half of a reset.

Yes, my guess is also that it is somehow related to the reconnect sequence/timing after the reset. From the stacktrace, it looks like the rs2_config_resolve() call internally uses device_hub::wait_for_device() so I would expect it to return once the device had reconnected, but maybe there is a better way of "polling" for when the device is available again?

MartyG-RealSense commented 1 year ago

It is possible to check if a camera is currently active ("busy"), as discussed at https://github.com/IntelRealSense/librealsense/issues/2240

Alternatively, you can listen for disconnection and reconnection events. So if a reconnection event is detected then that would indicate that the camera is likely available.

C++ https://github.com/IntelRealSense/librealsense/issues/931

Python https://github.com/IntelRealSense/librealsense/issues/4212

It is good practice to have a sleep period for a certain time after a hardware reset before initiating a subsequent instruction in order to ensure that the camera has completed the reset and is ready. I see that you are already doing this. A hardware reset should take 2 to 3 seconds to complete from start to finish. I believe that 2 seconds would be represented by a period of 2000 and 3 seconds by 3000.

jomiham commented 1 year ago

This could be because of a USB related problem.

Possibly, we sometimes get warnings in syslog (more details at the bottom of the main post), but I have seen several runs with SIGSEGV core dumps even when there are no such USB related warnings as well.

And interestingly, I basically only see the warnings (and the error codes from rs2_config_resolve() when we wait close to 500ms (which is slightly longer than the camera usually takes to re-appear on the USB bus)

MartyG-RealSense commented 1 year ago

Typically, rs2_delete_device() is not used with hardware reset. When a hardware reset is initiated and the camera disconnects, it will be removed from the list of attached devices upon disconnection anyway.

jomiham commented 1 year ago

Typically, rs2_delete_device() is not used with hardware reset. When a hardware reset is initiated and the camera disconnects, it will be removed from the list of attached devices upon disconnection anyway.

ok. I have removed the rs2_delete_device() calls after reset but still got intermittent SIGSEGVs.

I assume it doesn't matter which device we reset (depth/color in this case) since it will reset the entire camera?

MartyG-RealSense commented 1 year ago

hardware_reset resets the entire camera and not individual sensors, yes.

MartyG-RealSense commented 1 year ago

Unless there is a problem with camera activation that occurs consistently then performing a hardware reset after initial enabling of the pipeline is likely to be unnecessary. Plus, if the sensors have been successfully defined and gotten then the pipeline is already proven to be accessible.

jomiham commented 1 year ago

Unless there is a problem with camera activation that occurs consistently then performing a hardware reset after initial enabling of the pipeline is likely to be unnecessary. Plus, if the sensors have been successfully defined and gotten then the pipeline is already proven to be accessible.

I believe the initial restart is more to ensure that we have full control of what settings etc are used. But, unfortunately, we sometimes have intermittent hanging/freezing that requires us to restart the cameras on-demand as well so we need to be able to do this regardless.

MartyG-RealSense commented 1 year ago

If the entire port is reset with a general non-RealSense Ubuntu bash script instead of resetting the camera device specifically then it is not necessary for the camera to be detectable in order to complete the reset. An example of such a script is at https://github.com/IntelRealSense/librealsense/issues/8393

jomiham commented 1 year ago

If the entire port is reset with a general non-RealSense Ubuntu bash script instead of resetting the camera device specifically then it is not necessary for the camera to be detectable in order to complete the reset. An example of such a script is at #8393

I'm not sure I follow what you mean with "complete the reset"..? Wouldn't there still be a sensitive period while starting up regardless of how the reset is done (i.e. we still need to know when it is safe to start polling)?

jomiham commented 1 year ago

Alternatively, you can listen for disconnection and reconnection events. So if a reconnection event is detected then that would indicate that the camera is likely available.

C++ #931

Python #4212

Yes a callback event when the device is available would be great, but is there an actively maintained java wrapper? I found a reference to an android wrapper but I'm not sure if that would work for a generic java environment?

jomiham commented 1 year ago

So far, the only (temporary) workaround I have is to wait a "really long time" after resetting to ensure that we don't try to access it during this sensitive restart time. I have seen the odd SIGSEGV even when waiting for 1-2 seconds and I'm guessing it could be even longer in cold temperatures etc.

The drawback is that the system is quite time sensitive so we don't want to wait any longer than we have to (especially when restarting while running). But since waiting too short can crash the whole system, I need to have quite a bit of margin.

I assume the expected behaviour for rs2_config_resolve() / rs2_config_can_resolve() in the cases I get SIGSEGVs is to simply return an error code? I have seen both No device connected and Couldn't resolve requests errors so far (when it doesn't crash of course). In that case, we can go back to waiting for e.g. 1s and then try/retry resolving until it succeeds.

Would it be optimistic to hope for such a bugfix in the near future?

MartyG-RealSense commented 1 year ago

What I mean by "complete the reset" is that the camera hardware reset process depends on the camera being able to be detected again once it has been disconnected, otherwise the reset process will fail at the halfway point.

When an Ubuntu bash script is used to reset the port in general instead of a specific device attached to the port, it can be assumed that if the camera is able to be detected after the port reset then it should be able to be detected within a couple of seconds. To demonstrate this, you could launch the RealSense Viewer tool and then plug the camera in and observe how long it takes for the camera to appear in the Viewer's options side-panel.

So a safe period to wait after running the Ubuntu port reset script before trying to access the camera may be 3 seconds.

Regarding a Java wrapper, there is a non-Android one that a RealSense user contributed at the link below but it is now several years old. Also, this wrapper is not part of the official RealSense SDK and is only available at the below link's 'forked' version of the SDK which will be hugely out of date compared to the current official SDK version.

https://github.com/edwinRNDR/librealsense/tree/master/wrappers/java

This wrapper was hardly ever used by RealSense users, who developed in Java via the Android wrapper instead.

You may be able to put an exception catch mechanism in your code so that if an error occurs, the program catches it but keeps running instead of crashing. An example of such a mechanism in Java code is at https://github.com/IntelRealSense/librealsense/issues/3295

extern "C"
JNIEXPORT void JNICALL
Java_com_example_realsense_1app_MainActivity_stopStreaming(JNIEnv *env, jobject instance) {
try {
if (!streaming)
return;

    streaming = false;
    if (frame_thread.joinable())
        frame_thread.join();

}
catch (const std::exception& ex)
{
    jclass jcls = env->FindClass("java/lang/Exception");
    env->ThrowNew(jcls, ex.what());
}
jomiham commented 1 year ago

What I mean by "complete the reset" is that the camera hardware reset process depends on the camera being able to be detected again once it has been disconnected, otherwise the reset process will fail at the halfway point.

When an Ubuntu bash script is used to reset the port in general instead of a specific device attached to the port, it can be assumed that if the camera is able to be detected after the port reset then it should be able to be detected within a couple of seconds. To demonstrate this, you could launch the RealSense Viewer tool and then plug the camera in and observe how long it takes for the camera to appear in the Viewer's options side-panel.

So a safe period to wait after running the Ubuntu port reset script before trying to access the camera may be 3 seconds.

Sorry if I am being slow here but looking through that discussion, the author used the usb reset script to simulate an USB error and then hardware_reset() + wait to recover from it?

You may be able to put an exception catch mechanism in your code so that if an error occurs, the program catches it but keeps running instead of crashing. An example of such a mechanism in Java code is at https://github.com/IntelRealSense/librealsense/issues/3295

extern "C"
JNIEXPORT void JNICALL
Java_com_example_realsense_1app_MainActivity_stopStreaming(JNIEnv *env, jobject instance) {
try {
if (!streaming)
return;

    streaming = false;
    if (frame_thread.joinable())
        frame_thread.join();

}
catch (const std::exception& ex)
{
    jclass jcls = env->FindClass("java/lang/Exception");
    env->ThrowNew(jcls, ex.what());
}

This might work if we had a JNI C layer, but the native calls are handled by JNA and from what I can find it does not allow catching SEGV signals.

MartyG-RealSense commented 1 year ago

In general, Ubuntu port reset scripts have been used by RealSense users to reset the camera in situations where hardware_reset was not a suitable solution in their particular project. Only a very small number have used it though, with the majority opting for hardware_reset instead.

Using a USB reset script is also discussed at https://github.com/IntelRealSense/librealsense/issues/8274#issuecomment-770196280

An alternative to using rs2_config_resolve may be to instead use rs2_pipeline_get_active_profile to determine whether the pipeline is active. It returns a valid result only when the pipeline is active - between calls to start() and stop(). If the camera is reset with hardware_reset after the pipeline has been started then the pipeline should continue to be open so long as the camera reconnects within 5 seconds of disconnection.

https://intelrealsense.github.io/librealsense/doxygen/rs__pipeline_8h.html#a62dbdede39a1a1b1ce057022af33b7be


As you installed librealsense from packages, there should not be the timing issues that can occur if the SDK is built from source code with the RSUSB Backend.


As far as I am aware, temperature does not have a bearing on the time taken to complete a reset.

The camera should operate normally so long as its internal temperature in degrees C is above 0 and less than a maximum of 42. Whilst the camera can operate with an internal temperature greater than 42 degrees C, problems may begin to manifest and above 60 degrees C, the firmware driver's laser safety mechanism will shut the laser off. The laser will also be shut off if the internal temperature falls below zero.

jomiham commented 1 year ago

An alternative to using rs2_config_resolve may be to instead use rs2_pipeline_get_active_profile to determine whether the pipeline is active. It returns a valid result only when the pipeline is active - between calls to start() and stop().

In my case, I cannot make any assumptions on the camera's state when starting up (which is why we reset it).

If the camera is reset with hardware_reset after the pipeline has been started then the pipeline should continue to be open so long as the camera reconnects within 5 seconds of disconnection.

Where is the pipeline state preserved if the camera is reset? If the program (that uses the SDK to communicate with the camera) is restarted and then resets the camera, is there some other state in kernel drivers or similar that I should be aware of?

MartyG-RealSense commented 1 year ago

A pipeline has frames delivered to it by the source that is providing the frames (such as a live camera or a pre-recorded bag file), and if there are no new frames received after 5 seconds then the pipeline time-outs, typically resulting in the crashing of a program with the error message RuntimeError: Frame didn't arrive within 5000.

Once a pipeline has been started then it can continue to be open for this time period if the camera stops delivering frames whilst it waits for a new frame, but the pipeline may close after a period of 5 seconds without arrival of frames has passed. If new frames arrive within 5 seconds of the last received frame - for example, because the camera has reconnected - then the pipeline continues from the point that it left off when a frame was last received.

MartyG-RealSense commented 1 year ago

Hi @jomiham Do you have an update about this case that you can provide, please? Thanks!

jomiham commented 1 year ago

Hi @MartyG-RealSense No real update unfortunately. We are trying a longer wait time after the reset as a workaround but still hoping for a fix to the underlying SIGSEGV issue.

MartyG-RealSense commented 1 year ago

In regard to the problem seeming to manifest after upgrading from 2.30.0, there was a case at https://github.com/IntelRealSense/librealsense/issues/8154#issuecomment-764980206 where a RealSense user upgraded to SDK version 2.41.0 and was getting segfaults whenever they stopped the stream, whilst the problem did not occur in version 2.40.0.

MartyG-RealSense commented 1 year ago

Hi @jomiham Do you require further assistance with this case, please? Thanks!

jomiham commented 1 year ago

No, I think we have to settle for the "long wait" workaround until the bug is fixed. Thank you @MartyG-RealSense!

Will this issue will be updated when/if the underlying bug is fixed (that causes a SIGSEGV instead of returning an error)?

MartyG-RealSense commented 1 year ago

It appears that you are the only RealSense user who has reported using the rs2_config_resolve() instruction during the history of this support forum. A bug-fix would therefore likely be considered to be a low development priority unfortunately, especially as you are also using Java (a language that the RealSense SDK does not officially support with a wrapper).

MartyG-RealSense commented 1 year ago

Case closed due to no further assistance required at this time.