[linux] occasionally trap in deadloop on weak SOCs

Ncerzzk commented 2 years ago

Describe the bug HI, I run PX4 in a arm-cortex A7 core in linux. The chip is Allwinner V3s(single core, 64MB RAM). Kernel version: 4.10/5.2 PX4 version:

HW arch: PIPX4_V3S
FW git-hash: 7ab4cbf5e48d48bc3ec2de3179e6fb50ffc3505a
FW version: 1.13.0 0 (17629184)
FW git-branch: lichee
OS: Linux
OS version: Release 5.2.0 (84017407)
Build datetime: Mar 16 2022 21:31:56
Build uri: localhost
Build variant: default
Toolchain: GNU GCC, 6.3.1 20170404
PX4GUID: 100449495052303030443030303030303030

While sometimes when I start the px4, some threads would occupy almost 100% CPU ， and then other threads would be blocked or only have a little time to execute.

To Reproduce Steps to reproduce the behavior:

start PX4:./px4 -d -s v3s.config &
sometimes the theads would be blocked. The probability is almost 1/10.

Expected behavior A clear and concise description of what you expected to happen.

Log Files and Screenshots For example:

(The thread occupy 100% cpu is not only wkr_hrt, but also wq:SPI0 or logger, but the times of wkr_hrt is the most)

and work_queues: tt2

my config file:

. px4-alias.sh

if [ -f eeprom/parameters ]
then
        param load
fi
param save eeprom/parameters
param set CBRK_SUPPLY_CHK 894281
param set SYS_AUTOSTART 4001
param set MAV_TYPE 2

# Multi-EKF
param set EKF2_MULTI_IMU 1
param set SENS_IMU_MODE 0

dataman start
load_mon start

spl06 -X -a 0x77 start
qmc5883l -X start
icm20602 -s start

rc_update start
sensors start
commander start
navigator start
ekf2 start
land_detector start multicopter
mc_hover_thrust_estimator start
flight_mode_manager start
mc_pos_control start
mc_att_control start
mc_rate_control start

#mavlink start -x -d /dev/ttyS2

linux_pwm_out start

mavlink boot_complete

Add screenshots to help explain your problem.

Drone (please complete the following information):

Describe the type of drone.
Photo of the IMU / autopilot setup if possible.

Additional context Add any other context about the problem here.

Ncerzzk commented 2 years ago

tested in another board(orangepi lite, 4 cortex-A7), using the same sensors: 20220320114954

Ncerzzk commented 2 years ago

using gdb to attach the block thread, it seems the thread is traped in the function, especially in while loop of 'else' branch.

static void
hrt_call_enter(struct hrt_call *entry)
{
    struct hrt_call *call, *next;

    call = (struct hrt_call *)sq_peek(&callout_queue);

    if ((call == nullptr) || (entry->deadline < call->deadline)) {
        sq_addfirst(&entry->link, &callout_queue);
        //if (call != nullptr) PX4_INFO("call enter at head, reschedule (%lu %lu)", entry->deadline, call->deadline);
        /* we changed the next deadline, reschedule the timer event */
        hrt_call_reschedule();

    } else {
        do {
            next = (struct hrt_call *)sq_next(&call->link);

            if ((next == nullptr) || (entry->deadline < next->deadline)) {
                //lldbg("call enter after head\n");
                sq_addafter(&call->link, &entry->link, &callout_queue);
                break;
            }
        } while ((call = next) != nullptr);
    }
}

some variables and call stack info:

(gdb) p *next
$17 = {
  link = {
    flink = 0xb52006a8
  },
  deadline = 11051608128,
  period = 2500,
  callout = 0x18a2a9 <px4::ScheduledWorkItem::schedule_trampoline(void*)>,
  arg = 0xb5200680
}

(gdb) p callout_queue
$18 = {
  head = 0xb52006a8,
  tail = 0xc00540
}

(gdb) p *call
$21 = {
  link = {
    flink = 0xb52006a8
  },
  deadline = 11051608128,
  period = 2500,
  callout = 0x18a2a9 <px4::ScheduledWorkItem::schedule_trampoline(void*)>,
  arg = 0xb5200680
}

(gdb) p *entry
$22 = {
  link = {
    flink = 0xb52006a8
  },
  deadline = 11051608128,
  period = 2500,
  callout = 0x18a2a9 <px4::ScheduledWorkItem::schedule_trampoline(void*)>,
  arg = 0xb5200680
}

(gdb) bt
#0  hrt_call_enter (entry=0xb52006a8) at /home/ncer/PX4-Autopilot/platforms/posix/src/px4/common/drv_hrt.cpp:268
#1  0x0014ae94 in hrt_call_invoke () at /home/ncer/PX4-Autopilot/platforms/posix/src/px4/common/drv_hrt.cpp:480
#2  hrt_tim_isr (p=<optimized out>) at /home/ncer/PX4-Autopilot/platforms/posix/src/px4/common/drv_hrt.cpp:292
#3  0x0014cc68 in hrt_work_process () at /home/ncer/PX4-Autopilot/platforms/common/work_queue/hrt_thread.c:181
#4  work_hrtthread (argc=<optimized out>, argv=<optimized out>) at /home/ncer/PX4-Autopilot/platforms/common/work_queue/hrt_thread.c:264
#5  0x0014a612 in entry_adapter (ptr=0x2b2848 <_hrt_work>) at /home/ncer/PX4-Autopilot/platforms/posix/src/px4/common/tasks.cpp:98
#6  0xb6eeb9d2 in start_thread (arg=0x567efa1b) at pthread_create.c:477
#7  0xb6c99d4c in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6

It seems that due to some reasons there is a cycle in the list(node->flink==node) , so that it could not execute out of the loop. I'm trying to find how the cycle appeared.

Ncerzzk commented 2 years ago

I add some debug info in this func: hrt_call_enter

static void
hrt_call_enter(struct hrt_call *entry)
{
...
            if ((next == nullptr) || (entry->deadline < next->deadline)) {
                //lldbg("call enter after head\n");
                if(&call->link == &entry->link){
                    PX4_ERR("Warning: cycle apear!");
                }
                sq_addafter(&call->link, &entry->link, &callout_queue);

                break;
            }
        } while ((call = next) != nullptr);
    }
}

and try some ways to reproduce the problem more quickly: execute: sudo i2cdetect -y -a 0 ( to cause the jitter of schedule) then It happened: 20220320172013

Ncerzzk commented 2 years ago

So now I can confirm that there is a cycle in the hrt schedule list. But how the cycle apear?

static void
hrt_call_invoke()
{
    struct hrt_call *call;
    hrt_abstime deadline;

    hrt_lock();

    while (true) {
        /* get the current time */
        hrt_abstime now = hrt_absolute_time();

        call = (struct hrt_call *)sq_peek(&callout_queue);

        if (call == nullptr) {
            break;
        }

        if (call->deadline > now) {
            break;
        }

        sq_rem(&call->link, &callout_queue);

        deadline = call->deadline;
        call->deadline = 0;

        if (call->callout) {

            hrt_unlock();                                               
                          /* HERE, if the callout take a bit longer time
                             then there may be  chances that other threads add the same node at the head of list by calling  
                          hrt_call_internal() */ 

            call->callout(call->arg);

            hrt_lock();
        }

        if (call->period != 0) {
            if (call->deadline <= now) {
                call->deadline = deadline + call->period;
            }

            hrt_call_enter(call);
        }
    }

    hrt_unlock();
}

To resolve this issue, the workaround may be adding a checker in hrt_call_enter() before sq_addXXXX. But I'm not familiar with the schedule module of PX4, maybe there are some other problemes. Hi, @davids5 @dagar . Do you have any advice?

dagar commented 2 years ago

Thanks for the detailed information, I'm reviewing this now.

dagar commented 2 years ago

@Ncerzzk So far I haven't been able to reproduce this. I don't see your commit (7ab4cbf5e48d48) on github, what's the closest PX4 version?

When you reproduce in the debugger could you track down the caller in PX4 that made the schedule?

dagar commented 2 years ago

Could you try https://github.com/PX4/PX4-Autopilot/pull/19464?

Ncerzzk commented 2 years ago

sorry that the commit https://github.com/PX4/PX4-Autopilot/commit/7ab4cbf5e48d48bc3ec2de3179e6fb50ffc3505a is used for debuger so I did'nt push to github. The closest PX4 commit is 63189067f64426797b when I find the bug.

I would take your patch and try it.

Ncerzzk commented 2 years ago

Could you try #19464?

Hi, @dagar , I have tried this patch on my V3s board(with sensor icm20602,spl06,qmc5883l) and Orangepi Lite PC board(with sensor icm20602, spl06) , the problem seems to still exist.

20220410111757

( the orangepi board is a bit difficult to reproduce, so I use a simple script:

i=1
while [ $i -le 100 ]
do
i2cdump -y -a 1 0x76 w
  let i++
done

The debug branch is here: https://github.com/Ncerzzk/PX4-Autopilot/tree/posix_hrt_sem_wait_loop_with_my_own_debug_patch

I would run gdb on orangepi and add the debug info later.

Ncerzzk commented 2 years ago

other threads are all trap in syscall, may be waiting for the lock. 20220410114526

call statck:

(gdb) bt
#0  0x0018fa02 in hrt_call_enter (entry=0xb5300848) at /home/ncer/sata/px4-master/platforms/posix/src/px4/common/drv_hrt.cpp:269
#1  0x0018fbec in hrt_call_invoke () at /home/ncer/sata/px4-master/platforms/posix/src/px4/common/drv_hrt.cpp:484
#2  hrt_tim_isr (p=<optimized out>) at /home/ncer/sata/px4-master/platforms/posix/src/px4/common/drv_hrt.cpp:296
#3  0x00192cc8 in hrt_work_process () at /home/ncer/sata/px4-master/platforms/common/work_queue/hrt_thread.c:182
#4  work_hrtthread (argc=<optimized out>, argv=<optimized out>) at /home/ncer/sata/px4-master/platforms/common/work_queue/hrt_thread.c:266
#5  0x0018f35e in entry_adapter (ptr=0x2b7f50 <_hrt_work>) at /home/ncer/sata/px4-master/platforms/posix/src/px4/common/tasks.cpp:98
#6  0xb6f319d2 in start_thread (arg=0x465b6588) at pthread_create.c:477
#7  0xb6cdfd4c in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6

and the content of drv_hrt.cpp:269 is:

                do {
                        next = (struct hrt_call *)sq_next(&call->link);

                        if ((next == nullptr) || (entry->deadline < next->deadline)) {
                                if (&call->link == &entry->link) {
                                        PX4_ERR("Warning: cycle apear!");
                                }

                                //lldbg("call enter after head\n");
                                sq_addafter(&call->link, &entry->link, &callout_queue);
                                break;
                        }
                } while ((call = next) != nullptr);

dagar commented 2 years ago

I'll try to reproduce on something similar like a Raspberry Pi with Navio2.

Ncerzzk commented 2 years ago

HI, @dagar I found that there are 2 risk points in drv_hrt.cpp which would cause that one hrt_call entry would apear twice in the callout queue.

first is the queue remove checker in hrt_call_internal, there are chances that the entry didn't be removed from queue because of its deadline may be set to 0 in hrt_call_invoke: call->deadline = 0;

static void
hrt_call_internal(struct hrt_call *entry, hrt_abstime deadline, hrt_abstime interval, hrt_callout callout, void *arg)
{
    PX4_DEBUG("hrt_call_internal deadline=%lu interval = %lu", deadline, interval);
    hrt_lock();

    //PX4_INFO("hrt_call_internal after lock");
    /* if the entry is currently queued, remove it */
    /* note that we are using a potentially uninitialised
       entry->link here, but it is safe as sq_rem() doesn't
       dereference the passed node unless it is found in the
       list. So we potentially waste a bit of time searching the
       queue for the uninitialised entry->link but we don't do
       anything actually unsafe.
    */

        // debug info
    if (entry->period == 2500) {
        if (entry->deadline == 0) {
            PX4_ERR("error! deadline==0!  2500_cnt:%d", get_2500_cnt());
            //while(1);
        }
    }

    if (entry->deadline != 0) {
        sq_rem(&entry->link, &callout_queue);
    }

        // debug info
    if (entry->period == 2500) {
        if (entry->deadline == 0) {
            PX4_ERR("error! deadline==0!  2500_cnt:%d", get_2500_cnt());
            //while(1);
        }
    }

#if 1

    // Use this to debug busy CPU that keeps rescheduling with 0 period time
    /*if (interval < HRT_INTERVAL_MIN) {*/
    /*PX4_ERR("hrt_call_internal interval too short: %" PRIu64, interval);*/
    /*abort();*/
    /*}*/

#endif
    entry->deadline = deadline;
    entry->period = interval;
    entry->callout = callout;
    entry->arg = arg;

    hrt_call_enter(entry);
    hrt_unlock();
}

second is in hrt_call_invoke(), we removed the entry before we called its callout, while because we unlock during the callout executing, other threads may insert this entry again. This case is easy to explain in multi cores scenes，but in single core, I don't know how to explain it.

static void
hrt_call_invoke()
{
    struct hrt_call *call;
    hrt_abstime deadline;

    hrt_lock();

    while (true) {
        /* get the current time */
        hrt_abstime now = hrt_absolute_time();

        call = (struct hrt_call *)sq_peek(&callout_queue);

        if (call == nullptr) {
            break;
        }

        if (call->deadline > now) {
            break;
        }

        sq_rem(&call->link, &callout_queue);

        //PX4_INFO("call pop");

        /* save the intended deadline for periodic calls */
        deadline = call->deadline;

        /* zero the deadline, as the call has occurred */
        call->deadline = 0;

        /* invoke the callout (if there is one) */
        if (call->callout) {
            // Unlock so we don't deadlock in callback
            hrt_unlock();

            //PX4_INFO("call %p: %p(%p)", call, call->callout, call->arg);
            call->callout(call->arg);

            hrt_lock(); 

                        //debug info
            if (call->period == 2500 && get_2500_cnt() != 0) {
                PX4_ERR("count!=0!");
            }

        }

        /* if the callout has a non-zero period, it has to be re-entered */
        if (call->period != 0) {
            // re-check call->deadline to allow for
            // callouts to re-schedule themselves
            // using hrt_call_delay()
            if (call->deadline <= now) {
                call->deadline = deadline + call->period;
                //PX4_INFO("call deadline set to %lu now=%lu", call->deadline,  now);
            }

            hrt_call_enter(call);
        }
    }

    hrt_unlock();
}

Ncerzzk commented 2 years ago

I add a lot of debug info in drv_hrt.cpp, because of which it's very easy to reproduce in my OrangePi. complete file is here: https://github.com/Ncerzzk/PX4-Autopilot/blob/ce02e33b965c9a76de5c7005133576fea6259c8b/platforms/posix/src/px4/common/drv_hrt.cpp

I would continue trying to reproduce it more simply

PX4 / PX4-Autopilot

[linux] occasionally trap in deadloop on weak SOCs #19358