Open Ncerzzk opened 2 years ago
tested in another board(orangepi lite, 4 cortex-A7), using the same sensors:
using gdb to attach the block thread, it seems the thread is traped in the function, especially in while loop of 'else' branch.
static void
hrt_call_enter(struct hrt_call *entry)
{
struct hrt_call *call, *next;
call = (struct hrt_call *)sq_peek(&callout_queue);
if ((call == nullptr) || (entry->deadline < call->deadline)) {
sq_addfirst(&entry->link, &callout_queue);
//if (call != nullptr) PX4_INFO("call enter at head, reschedule (%lu %lu)", entry->deadline, call->deadline);
/* we changed the next deadline, reschedule the timer event */
hrt_call_reschedule();
} else {
do {
next = (struct hrt_call *)sq_next(&call->link);
if ((next == nullptr) || (entry->deadline < next->deadline)) {
//lldbg("call enter after head\n");
sq_addafter(&call->link, &entry->link, &callout_queue);
break;
}
} while ((call = next) != nullptr);
}
}
some variables and call stack info:
(gdb) p *next
$17 = {
link = {
flink = 0xb52006a8
},
deadline = 11051608128,
period = 2500,
callout = 0x18a2a9 <px4::ScheduledWorkItem::schedule_trampoline(void*)>,
arg = 0xb5200680
}
(gdb) p callout_queue
$18 = {
head = 0xb52006a8,
tail = 0xc00540
}
(gdb) p *call
$21 = {
link = {
flink = 0xb52006a8
},
deadline = 11051608128,
period = 2500,
callout = 0x18a2a9 <px4::ScheduledWorkItem::schedule_trampoline(void*)>,
arg = 0xb5200680
}
(gdb) p *entry
$22 = {
link = {
flink = 0xb52006a8
},
deadline = 11051608128,
period = 2500,
callout = 0x18a2a9 <px4::ScheduledWorkItem::schedule_trampoline(void*)>,
arg = 0xb5200680
}
(gdb) bt
#0 hrt_call_enter (entry=0xb52006a8) at /home/ncer/PX4-Autopilot/platforms/posix/src/px4/common/drv_hrt.cpp:268
#1 0x0014ae94 in hrt_call_invoke () at /home/ncer/PX4-Autopilot/platforms/posix/src/px4/common/drv_hrt.cpp:480
#2 hrt_tim_isr (p=<optimized out>) at /home/ncer/PX4-Autopilot/platforms/posix/src/px4/common/drv_hrt.cpp:292
#3 0x0014cc68 in hrt_work_process () at /home/ncer/PX4-Autopilot/platforms/common/work_queue/hrt_thread.c:181
#4 work_hrtthread (argc=<optimized out>, argv=<optimized out>) at /home/ncer/PX4-Autopilot/platforms/common/work_queue/hrt_thread.c:264
#5 0x0014a612 in entry_adapter (ptr=0x2b2848 <_hrt_work>) at /home/ncer/PX4-Autopilot/platforms/posix/src/px4/common/tasks.cpp:98
#6 0xb6eeb9d2 in start_thread (arg=0x567efa1b) at pthread_create.c:477
#7 0xb6c99d4c in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
It seems that due to some reasons there is a cycle in the list(node->flink==node
) , so that it could not execute out of the loop.
I'm trying to find how the cycle appeared.
I add some debug info in this func: hrt_call_enter
static void
hrt_call_enter(struct hrt_call *entry)
{
...
if ((next == nullptr) || (entry->deadline < next->deadline)) {
//lldbg("call enter after head\n");
if(&call->link == &entry->link){
PX4_ERR("Warning: cycle apear!");
}
sq_addafter(&call->link, &entry->link, &callout_queue);
break;
}
} while ((call = next) != nullptr);
}
}
and try some ways to reproduce the problem more quickly:
execute: sudo i2cdetect -y -a 0
( to cause the jitter of schedule)
then It happened:
So now I can confirm that there is a cycle in the hrt schedule list. But how the cycle apear?
static void
hrt_call_invoke()
{
struct hrt_call *call;
hrt_abstime deadline;
hrt_lock();
while (true) {
/* get the current time */
hrt_abstime now = hrt_absolute_time();
call = (struct hrt_call *)sq_peek(&callout_queue);
if (call == nullptr) {
break;
}
if (call->deadline > now) {
break;
}
sq_rem(&call->link, &callout_queue);
deadline = call->deadline;
call->deadline = 0;
if (call->callout) {
hrt_unlock();
/* HERE, if the callout take a bit longer time
then there may be chances that other threads add the same node at the head of list by calling
hrt_call_internal() */
call->callout(call->arg);
hrt_lock();
}
if (call->period != 0) {
if (call->deadline <= now) {
call->deadline = deadline + call->period;
}
hrt_call_enter(call);
}
}
hrt_unlock();
}
To resolve this issue, the workaround may be adding a checker in hrt_call_enter() before sq_addXXXX. But I'm not familiar with the schedule module of PX4, maybe there are some other problemes. Hi, @davids5 @dagar . Do you have any advice?
Thanks for the detailed information, I'm reviewing this now.
@Ncerzzk So far I haven't been able to reproduce this. I don't see your commit (7ab4cbf5e48d48) on github, what's the closest PX4 version?
When you reproduce in the debugger could you track down the caller in PX4 that made the schedule?
Could you try https://github.com/PX4/PX4-Autopilot/pull/19464?
sorry that the commit https://github.com/PX4/PX4-Autopilot/commit/7ab4cbf5e48d48bc3ec2de3179e6fb50ffc3505a is used for debuger so I did'nt push to github. The closest PX4 commit is 63189067f64426797b when I find the bug.
I would take your patch and try it.
Could you try #19464?
Hi, @dagar , I have tried this patch on my V3s board(with sensor icm20602,spl06,qmc5883l) and Orangepi Lite PC board(with sensor icm20602, spl06) , the problem seems to still exist.
( the orangepi board is a bit difficult to reproduce, so I use a simple script:
i=1
while [ $i -le 100 ]
do
i2cdump -y -a 1 0x76 w
let i++
done
The debug branch is here: https://github.com/Ncerzzk/PX4-Autopilot/tree/posix_hrt_sem_wait_loop_with_my_own_debug_patch
I would run gdb on orangepi and add the debug info later.
other threads are all trap in syscall, may be waiting for the lock.
call statck:
(gdb) bt
#0 0x0018fa02 in hrt_call_enter (entry=0xb5300848) at /home/ncer/sata/px4-master/platforms/posix/src/px4/common/drv_hrt.cpp:269
#1 0x0018fbec in hrt_call_invoke () at /home/ncer/sata/px4-master/platforms/posix/src/px4/common/drv_hrt.cpp:484
#2 hrt_tim_isr (p=<optimized out>) at /home/ncer/sata/px4-master/platforms/posix/src/px4/common/drv_hrt.cpp:296
#3 0x00192cc8 in hrt_work_process () at /home/ncer/sata/px4-master/platforms/common/work_queue/hrt_thread.c:182
#4 work_hrtthread (argc=<optimized out>, argv=<optimized out>) at /home/ncer/sata/px4-master/platforms/common/work_queue/hrt_thread.c:266
#5 0x0018f35e in entry_adapter (ptr=0x2b7f50 <_hrt_work>) at /home/ncer/sata/px4-master/platforms/posix/src/px4/common/tasks.cpp:98
#6 0xb6f319d2 in start_thread (arg=0x465b6588) at pthread_create.c:477
#7 0xb6cdfd4c in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
and the content of drv_hrt.cpp:269 is:
do {
next = (struct hrt_call *)sq_next(&call->link);
if ((next == nullptr) || (entry->deadline < next->deadline)) {
if (&call->link == &entry->link) {
PX4_ERR("Warning: cycle apear!");
}
//lldbg("call enter after head\n");
sq_addafter(&call->link, &entry->link, &callout_queue);
break;
}
} while ((call = next) != nullptr);
I'll try to reproduce on something similar like a Raspberry Pi with Navio2.
HI, @dagar
I found that there are 2 risk points in drv_hrt.cpp
which would cause that one hrt_call entry
would apear twice in the callout
queue.
first is the queue remove checker in hrt_call_internal
, there are chances that the entry didn't be removed from queue because of its deadline
may be set to 0 in hrt_call_invoke
: call->deadline = 0;
static void
hrt_call_internal(struct hrt_call *entry, hrt_abstime deadline, hrt_abstime interval, hrt_callout callout, void *arg)
{
PX4_DEBUG("hrt_call_internal deadline=%lu interval = %lu", deadline, interval);
hrt_lock();
//PX4_INFO("hrt_call_internal after lock");
/* if the entry is currently queued, remove it */
/* note that we are using a potentially uninitialised
entry->link here, but it is safe as sq_rem() doesn't
dereference the passed node unless it is found in the
list. So we potentially waste a bit of time searching the
queue for the uninitialised entry->link but we don't do
anything actually unsafe.
*/
// debug info
if (entry->period == 2500) {
if (entry->deadline == 0) {
PX4_ERR("error! deadline==0! 2500_cnt:%d", get_2500_cnt());
//while(1);
}
}
if (entry->deadline != 0) {
sq_rem(&entry->link, &callout_queue);
}
// debug info
if (entry->period == 2500) {
if (entry->deadline == 0) {
PX4_ERR("error! deadline==0! 2500_cnt:%d", get_2500_cnt());
//while(1);
}
}
#if 1
// Use this to debug busy CPU that keeps rescheduling with 0 period time
/*if (interval < HRT_INTERVAL_MIN) {*/
/*PX4_ERR("hrt_call_internal interval too short: %" PRIu64, interval);*/
/*abort();*/
/*}*/
#endif
entry->deadline = deadline;
entry->period = interval;
entry->callout = callout;
entry->arg = arg;
hrt_call_enter(entry);
hrt_unlock();
}
second is in hrt_call_invoke()
, we removed the entry before we called its callout
, while because we unlock during the callout
executing, other threads may insert this entry again. This case is easy to explain in multi cores scenes,but in single core, I don't know how to explain it.
static void
hrt_call_invoke()
{
struct hrt_call *call;
hrt_abstime deadline;
hrt_lock();
while (true) {
/* get the current time */
hrt_abstime now = hrt_absolute_time();
call = (struct hrt_call *)sq_peek(&callout_queue);
if (call == nullptr) {
break;
}
if (call->deadline > now) {
break;
}
sq_rem(&call->link, &callout_queue);
//PX4_INFO("call pop");
/* save the intended deadline for periodic calls */
deadline = call->deadline;
/* zero the deadline, as the call has occurred */
call->deadline = 0;
/* invoke the callout (if there is one) */
if (call->callout) {
// Unlock so we don't deadlock in callback
hrt_unlock();
//PX4_INFO("call %p: %p(%p)", call, call->callout, call->arg);
call->callout(call->arg);
hrt_lock();
//debug info
if (call->period == 2500 && get_2500_cnt() != 0) {
PX4_ERR("count!=0!");
}
}
/* if the callout has a non-zero period, it has to be re-entered */
if (call->period != 0) {
// re-check call->deadline to allow for
// callouts to re-schedule themselves
// using hrt_call_delay()
if (call->deadline <= now) {
call->deadline = deadline + call->period;
//PX4_INFO("call deadline set to %lu now=%lu", call->deadline, now);
}
hrt_call_enter(call);
}
}
hrt_unlock();
}
I add a lot of debug info in drv_hrt.cpp
, because of which it's very easy to reproduce in my OrangePi.
complete file is here: https://github.com/Ncerzzk/PX4-Autopilot/blob/ce02e33b965c9a76de5c7005133576fea6259c8b/platforms/posix/src/px4/common/drv_hrt.cpp
I would continue trying to reproduce it more simply
Describe the bug HI, I run PX4 in a arm-cortex A7 core in linux. The chip is Allwinner V3s(single core, 64MB RAM). Kernel version: 4.10/5.2 PX4 version:
While sometimes when I start the px4, some threads would occupy almost 100% CPU , and then other threads would be blocked or only have a little time to execute.
To Reproduce Steps to reproduce the behavior:
./px4 -d -s v3s.config &
Expected behavior A clear and concise description of what you expected to happen.
Log Files and Screenshots For example:
(The thread occupy 100% cpu is not only wkr_hrt, but also wq:SPI0 or logger, but the times of wkr_hrt is the most)
and work_queues:
my config file:
Add screenshots to help explain your problem.
Drone (please complete the following information):
Additional context Add any other context about the problem here.