chirpstack / chirpstack-concentratord

Concentrator HAL daemon for LoRa gateways.
https://www.chirpstack.io/
MIT License
70 stars 50 forks source link

Chripstack-concentratord-sx1302 v4.3.0 crashed with Segmentation fault #107

Closed LouneCode closed 9 months ago

LouneCode commented 9 months ago

Chripstack-concentratord-sx1302 v4.3.0 crashed with segmentation fault when using fine timestamping.

Test environment

The Chripstack-concentratord-sx1302 module is compiled according to the instructions in this repository.

Concentratord crashed randomly ...

Consentratord crashed randomly with SIGSEGV signal Segmentation fault. It seems that segmentation fault occurs (only?) if the Fine Stamping is enabled.

Fine timestamping is enabled in a concentratord configuration file as follows In the test case:

sx1303.toml
...

  # Fine_timestamp
  # enable
  #   true
  #   false
  # mode
  #   "ALL_SF"
  #   "HIGH_CAPACITY"
[gateway.fine_timestamp]
  enable=true
  mode="ALL_SF"
...

Concentratord start command: ./chirpstack-concentratord-sx1302 -c sx1303.toml

Concentratord runs for a random amount of the time and crashes with a segmentation fault error. The log does not show the reason or any information about the crash. Only hint is that segmentation fault has been raised. Not good.

I've investigated and debugged chirpstack-consentratord-sx1302 code and found that the problem lies behind the Rust FFI in the customized c code of sx1302_hal layer.

After a long run I got a full back trace log with the gdb debugger.

#0  0x00007ffff772224a in compare_pkt_tmst ()
No symbol table info available.
#1  0x00007ffff7c01d0f in sift () at ../src_musl/src/stdlib/qsort.c:103
No locals.
#2  0x00007ffff7c01eb1 in trinkle () at ../src_musl/src/stdlib/qsort.c:154
No locals.
#3  0x00007ffff7c02082 in __qsort_r () at ../src_musl/src/stdlib/qsort.c:200
No locals.
#4  0x00007ffff7723d2a in lgw_receive ()
No symbol table info available.
#5  0x00007ffff771c5e7 in libloragw_sx1302::hal::receive () at libloragw-sx1302/src/hal.rs:932
        _guard = std::sync::mutex::MutexGuard<()> {lock: 0x7ffff7ff8c88 <<libloragw_sx1302::mutex::CONCENTATOR as core::ops::deref::Deref>::deref::__stability::LAZY+4>, poison: std::sync::poison::Guard {panicking: false}}
        packets = [
          libloragw_sx1302::wrapper::lgw_pkt_rx_s {freq_hz: 868500000, freq_offset: 854, if_chain: 2, status: 16, count_us: 254499580, rf_chain: 1, modem_id: 9, modulation: 16, bandwidth: 4, datarate: 7, coderate: 1, rssic: -40.3999939, rssis: -41.3999939, snr: 9.5, snr_min: 0, snr_max: 0, crc: 57943, size: 23, payload: [64, 38, 51, 248, 0, 128, 148, 143, 3, 200, 183, 189, 141, 158, 16, 176, 97, 176, 164, 99, 92, 183, 23, 0 <repeats 233 times>], ftime_received: true, ftime: 877986163},
          libloragw_sx1302::wrapper::lgw_pkt_rx_s {freq_hz: 868100000, freq_offset: -24138, if_chain: 0, status: 17, count_us: 254499587, rf_chain: 1, modem_id: 11, modulation: 16, bandwidth: 4, datarate: 7, coderate: 1, rssic: -96.3999939, rssis: -100.399994, snr: -7.25, snr_min: 0, snr_max: 0, crc: 57951, size: 23, payload: [112, 38, 55, 232, 2, 12, 148, 143, 195, 240, 183, 190, 193, 158, 17, 118, 97, 128, 164, 163, 124, 183, 177, 0 <repeats 233 times>], ftime_received: false, ftime: 0},
          libloragw_sx1302::wrapper::lgw_pkt_rx_s {freq_hz: 868100000, freq_offset: -24138, if_chain: 0, status: 17, count_us: 254499588, rf_chain: 1, modem_id: 10, modulation: 16, bandwidth: 4, datarate: 7, coderate: 1, rssic: -96.3999939, rssis: -100.399994, snr: -7.5, snr_min: 0, snr_max: 0, crc: 57942, size: 23, payload: [112, 68, 183, 248, 3, 8, 144, 143, 195, 112, 183, 238, 4, 158, 19, 118, 97, 128, 164, 163, 92, 183, 129, 0 <repeats 233 times>], ftime_received: false, ftime: 0},
          libloragw_sx1302::wrapper::lgw_pkt_rx_s {freq_hz: 868500000, freq_offset: 854, if_chain: 2, status: 16, count_us: 254499580, rf_chain: 1, modem_id: 8, modulation: 16, bandwidth: 4, datarate: 7, coderate: 1, rssic: -40.3999939, rssis: -41.3999939, snr: 11.5, snr_min: 0, snr_max: 0, crc: 57943, size: 23, payload: [64, 38, 51, 248, 0, 128, 148, 143, 3, 200, 183, 189, 141, 158, 16, 176, 97, 176, 164, 99, 92, 183, 23, 0 <repeats 233 times>], ftime_received: false, ftime: 0},
          libloragw_sx1302::wrapper::lgw_pkt_rx_s {freq_hz: 0, freq_offset: 0, if_chain: 0, status: 0, count_us: 0, rf_chain: 0, modem_id: 0, modulation: 0, bandwidth: 0, datarate: 0, coderate: 0, rssic: 0, rssis: 0, snr: 0, snr_min: 0, snr_max: 0, crc: 0, size: 0, payload: [
              0 <repeats 256 times>], ftime_received: false, ftime: 0},
          libloragw_sx1302::wrapper::lgw_pkt_rx_s {freq_hz: 0, freq_offset: 0, if_chain: 0, status: 0, count_us: 0, rf_chain: 0, modem_id: 0, modulation: 0, bandwidth: 0, datarate: 0, coderate: 0, rssic: 0, rssis: 0, snr: 0, snr_min: 0, snr_max: 0, crc: 0, size: 0, payload: [
              0 <repeats 256 times>], ftime_received: false, ftime: 0},
          libloragw_sx1302::wrapper::lgw_pkt_rx_s {freq_hz: 0, freq_offset: 0, if_chain: 0, status: 0, count_us: 0, rf_chain: 0, modem_id: 0, modulation: 0, bandwidth: 0, datarate: 0, coderate: 0, rssic: 0, rssis: 0, snr: 0, snr_min: 0, snr_max: 0, crc: 0, size: 0, payload: [
              0 <repeats 256 times>], ftime_received: false, ftime: 0},
          libloragw_sx1302::wrapper::lgw_pkt_rx_s {freq_hz: 0, freq_offset: 0, if_chain: 0, status: 0, count_us: 0, rf_chain: 0, modem_id: 0, modulation: 0, bandwidth: 0, datarate: 0, coderate: 0, rssic: 0, rssis: 0, snr: 0, snr_min: 0, snr_max: 0, crc: 0, size: 0, payload: [
              0 <repeats 256 times>], ftime_received: false, ftime: 0}]
#6  0x00007ffff75c3e45 in chirpstack_concentratord_sx1302::handler::uplink::handle_loop (gateway_id=..., stop_receive=..., disable_crc_filter=false, time_fallback=false) at chirpstack-concentratord-sx1302/src/handler/uplink.rs:25
No locals.
#7  0x00007ffff760e091 in chirpstack_concentratord_sx1302::cmd::root::run::{closure#0} () at chirpstack-concentratord-sx1302/src/cmd/root.rs:76
        gateway_id = <optimized out>
        stop_receive = <optimized out>
        disable_crc_filter = <optimized out>
        time_fallback = <optimized out>
#8  0x00007ffff75bb6c9 in std::sys_common::backtrace::__rust_begin_short_backtrace<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()> (f=...) at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys_common/backtrace.rs:135
No locals.
#9  0x00007ffff75ac151 in std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure#0}<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()> () at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/thread/mod.rs:529
        f = <optimized out>
#10 0x00007ffff75a3251 in core::panic::unwind_safe::{impl#23}::call_once<(), std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()>> (self=...)
    at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panic/unwind_safe.rs:271
        _args = ()
#11 0x00007ffff760bdfa in std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()>>, ()> (data=0x7ffff72a17c0)
    at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:500
        f = core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()>> (std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()> {f: chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0} {gateway_id: [0, 22, 192, 1, 255, 31, 15,
                27], stop_receive: std::sync::mpsc::Receiver<libconcentratord::signals::Signal> {inner: std::sync::mpmc::Receiver<libconcentratord::signals::Signal> {flavor: std::sync::mpmc::ReceiverFlavor<libconcentratord::signals::Signal>::List(std::sync::mpmc::counter::Receiver<std::sync::mpmc::list::Channel<libconcentratord::signals::Signal>> {counter: 0x7ffff7e61080})}}, disable_crc_filter: false, time_fallback: false}})
        data = 0x7ffff72a17c0
        data = 0x7ffff72a17c0
#12 0x00007ffff760c43b in __rust_try ()
No symbol table info available.
#13 0x00007ffff760bc41 in std::panicking::try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()>>> (f=...)
    at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:464
        data_ptr = 0x7ffff72a17c0
        data = std::panicking::try::Data<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()>>, ()> {f: core::mem::manually_drop::ManuallyDrop<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()>>> {value: core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()>> (std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()> {f: chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0} {gateway_id: [0, 22, 192, 1, 255, 31, 15,
                    27], stop_receive: std::sync::mpsc::Receiver<libconcentratord::signals::Signal> {inner: std::sync::mpmc::Receiver<libconcentratord::signals::Signal> {flavor: std::sync::mpmc::ReceiverFlavor<libconcentratord::signals::Signal>::List(std::sync::mpmc::counter::Receiver<std::sync::mpmc::list::Channel<libconcentratord::signals::Signal>> {counter: 0x7ffff7e61080})}}, disable_crc_filter: false, time_fallback: false}})}, r: core::mem::manually_drop::ManuallyDrop<()> {value: ()}, p: core::mem::manually_drop::ManuallyDrop<alloc::boxed::Box<(dyn core::any::Any + core::marker::Send), alloc::alloc::Global>> {value: alloc::boxed::Box<(dyn core::any::Any + core::marker::Send), alloc::alloc::Global> {pointer: 0x1, vtable: 0x7ffff7e61080}}}
#14 0x00007ffff75aa78a in std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()>>, ()> (f=...)
    at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panic.rs:142
No locals.
--Type <RET> for more, q to quit, c to continue without paging--
#15 std::thread::{impl#0}::spawn_unchecked_::{closure#1}<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()> () at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/thread/mod.rs:528
        f = chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0} {gateway_id: [0, 22, 192, 1, 255, 31, 15,
            27], stop_receive: std::sync::mpsc::Receiver<libconcentratord::signals::Signal> {inner: std::sync::mpmc::Receiver<libconcentratord::signals::Signal> {flavor: std::sync::mpmc::ReceiverFlavor<libconcentratord::signals::Signal>::List(std::sync::mpmc::counter::Receiver<std::sync::mpmc::list::Channel<libconcentratord::signals::Signal>> {counter: 0x7ffff7e61080})}}, disable_crc_filter: false, time_fallback: false}
        their_packet = alloc::sync::Arc<std::thread::Packet<()>> {ptr: core::ptr::non_null::NonNull<alloc::sync::ArcInner<std::thread::Packet<()>>> {pointer: 0x7ffff7e62a20}, phantom: core::marker::PhantomData<alloc::sync::ArcInner<std::thread::Packet<()>>>}
        f = std::thread::{impl#0}::spawn_unchecked_::MaybeDangling<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}> (core::mem::maybe_uninit::MaybeUninit<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}> {uninit: (), value: core::mem::manually_drop::ManuallyDrop<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}> {value: chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0} {gateway_id: [0, 22, 192, 1, 255, 31, 15,
                  27], stop_receive: std::sync::mpsc::Receiver<libconcentratord::signals::Signal> {inner: std::sync::mpmc::Receiver<libconcentratord::signals::Signal> {flavor: std::sync::mpmc::ReceiverFlavor<libconcentratord::signals::Signal>::List(std::sync::mpmc::counter::Receiver<std::sync::mpmc::list::Channel<libconcentratord::signals::Signal>> {counter: 0x7ffff7e61080})}}, disable_crc_filter: false, time_fallback: false}}})
        output_capture = core::option::Option<alloc::sync::Arc<std::sync::mutex::Mutex<alloc::vec::Vec<u8, alloc::alloc::Global>>>>::None
        their_thread = std::thread::Thread {inner: core::pin::Pin<alloc::sync::Arc<std::thread::Inner>> {pointer: alloc::sync::Arc<std::thread::Inner> {ptr: core::ptr::non_null::NonNull<alloc::sync::ArcInner<std::thread::Inner>> {pointer: 0x7ffff7ec13b0}, phantom: core::marker::PhantomData<alloc::sync::ArcInner<std::thread::Inner>>}}}
#16 0x00007ffff75bd54e in core::ops::function::FnOnce::call_once<std::thread::{impl#0}::spawn_unchecked_::{closure_env#1}<chirpstack_concentratord_sx1302::cmd::root::run::{closure_env#0}, ()>, ()> ()
    at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/ops/function.rs:250
No locals.
#17 0x00007ffff7bc3bb5 in alloc::boxed::{impl#47}::call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:1993
No locals.
#18 alloc::boxed::{impl#47}::call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:1993
No locals.
#19 std::sys::unix::thread::{impl#2}::new::thread_start () at library/std/src/sys/unix/thread.rs:108
No locals.
#20 0x00007ffff7bf99bd in start () at ../src_musl/src/thread/pthread_create.c:203
No locals.
#21 0x00007ffff7bfb127 in __clone () at ../src_musl/src/thread/x86_64/clone.s:22
No locals.
Backtrace stopped: frame did not save the PC

Here is a simplified call stack and now we can see where the problem lies.

libloragw-sx1302/src/hal.rs:932
  -> libloragw/src/loragw_hal.c lgw_receive()      
    -> libloragw/src/loragw_hal.c merge_packets()
      -> libloragw/src/loragw_hal.c qsort()
        -> libloragw/src/loragw_hal.c compare_pkt_tmst() -> Thread 9 "chirpstack-conc" received signal SIGSEGV, Segmentation fault.

Concentratord Segmentation fault caused by compare_pkt_tmst() c code function in customized sx1302_hal module. This module was changed last in 9aef4ac commit which updates qsort() functionality in merge_packets() function (libloragw/src/loragw_hal.c).

I think that after 9aef4ac commit the comparator function compare_pkt_tmst() of qsort not work as it should be and cause the segmentation fault (at least in this case). Error can due to wrong argument list in comparator function compare_pkt_tmst().

Segmentation fault happens when LoRa radio receives many packets in a same time. In code, the information of received packets are sorted and duplicate packets are removed.

Issue can be fixed changing code of loragw_hal.c:

Before fix:


int compare_pkt_tmst(const void *a, const void *b, void *arg)
{
    struct lgw_pkt_rx_s *p = (struct lgw_pkt_rx_s *)a;
    struct lgw_pkt_rx_s *q = (struct lgw_pkt_rx_s *)b;
    int *counter = (int *)arg;
    int p_count, q_count;

    p_count = p->count_us;
    q_count = q->count_us;

    if (p_count > q_count) {
        *counter = *counter + 1;
    }

    return (p_count - q_count);
}

/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ */

After fix:

/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ */

int compare_pkt_tmst(const void *a, const void *b)
{
    int p_count, q_count;

    p_count = ((struct lgw_pkt_rx_s *)a)->count_us;
    q_count = ((struct lgw_pkt_rx_s *)b)->count_us;

    return (p_count - q_count);
}

/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ */

     

Let's do IoT better

LouneCode commented 9 months ago

Another hint

Also sx1302_hal/tree/master/packet_forwarder/src /jitqueue.c has a qsort_r() functionality, similar to loragw_hal.c. If there was a special reason to optimize the old sorting method with qsort() . . .

brocaar commented 9 months ago

Thank you so much @LouneCode for debugging this issue in so much detail! The reason why this was changed is because qsort_r is not available in musl. I took these changes from an other patch, but it seems I took only part of the changes.

I will fix this with the proposed changes. If you would like, I'm also open for a pull-request.

LouneCode commented 9 months ago

Good that things are progressing....;)
Please, open the pull-request.
FYI now after that fix above, the chirpstack-consentratord-sx1302 process rock like a charm. No more segmentation fault errors.
You have done a good job. Thanks.

     

Let's do IoT better

brocaar commented 9 months ago

@LouneCode this commit contains the proposed changes: https://github.com/brocaar/sx1302_hal/commit/c3d99009556fdfe273c3a53306082ef181333c7a. Could you do a final check to make sure this is what you were proposing?

LouneCode commented 9 months ago

Thanks for the quick response. Sure, I check it out...
Ok, I have walked thru commits on both repositories. Changes are correct.

I also compiled the latest version of the code of this repository (make build) and did a test run on the chirpstack-concentratord-sx1302 process: Everything seems fine.       Let's do IoT better

brocaar commented 9 months ago

Thanks again for your help @LouneCode :+1: