apache / nuttx

Apache NuttX is a mature, real-time embedded operating system (RTOS)
https://nuttx.apache.org/
Apache License 2.0
2.62k stars 1.11k forks source link

Low priority tasks interrupt high priority threads #6012

Closed zouboan closed 10 months ago

zouboan commented 2 years ago

By SystemView, I found some puzzling result, low priority FOC main task interrupt high priority FOC control pthread as shown in the figure around 23.920536845s, the problem led to the failure to complete the control of this cycle. My configuration: STM32F405RG CPU freq 168MHz FOC_PWM_FREQ=10000 FOC_NOTIFIER_FREQ=10000

Context:
ISR34 - FOC ADC handler ISR76 - Aux ADC DMA handler (VBUS) ISR15 - SysTick ISR3 - Hard Fault pt-0x801ae29 - FOC control pthread (priority 255) foc task - FOC main task, low priority work (priority 150)

image

patacongo commented 2 years ago

It is difficult to have an opinion if you are unfamiliar with the details of the implementation of the code. But what this looks like to me is that:

There is possibly some setup between the task and the interrupt handler. What triggers signally the task from the interrupt handler? Is if a boolean variable? Is it a counting semaphore? I have seen cases where the interrupt occurs before the boolean request is set so the task is not signaled. Counting semaphores may also require some setup as well. The count must be decremented before it is incremented the next time of the sem_post() has no effect.

Again, I don't know this code, but it appears to me that the handshake between the task and the interrupt handler is not working properly.

raiden00pl commented 2 years ago

I think I know what's going on.

I set breakpoint in arm_svcall, which is triggered when pthread is interrupted. What I found is that pthread is waiting for semaphore in fs/inode/fs_files.c and this switch execution to low priority main.

When I removed nxsem_wait_uninterruptible from _files_semtake pthread is never interrupted.

It goes more or less like this:

  1. low priority main wakes up just before FOC interrupt
  2. main gets fs/inode/fs_files.c semaphore (fs_getfilep -> _files_semtake)
  3. FOC interrupt occurs and wakes up FOC pthread
  4. pthread tries to get fs/inodefs_files.c semaphtore
  5. pthread waits for semaphore
  6. context switch to main
zouboan commented 2 years ago

Pretty cool @raiden00pl , Now the mechanism is completely clear, FOC main interrupted FOC pthread when pthread execute ret = foc_dev_params_set(&dev); and what resulted the execution of FOC over the time that next FOC interrupt arrived, because sval in foc_notifier is zero, the FOC interrupt won't nxsem_post(&dev->statesem) and the control of this cycle is lost until next FOC interrupt. I tried removed nxsem_wait_uninterruptible from _files_semtake , it indeed solve my problem of motor stuck at high speed. But of course, this is not the right way, maybe i should reduce FOC_NOTIFIER_FREQ to 5000 for safty.

raiden00pl commented 2 years ago

The problem is that both main and pthread belong to the same task group and therefore share the same struct filelist tg_filelist in struct task_group_s:

https://github.com/apache/incubator-nuttx/blob/dd70d29d4e919c47951926b34c5dc88785592e6f/include/nuttx/sched.h#L546-L549

There is a chance that an interrupt that wakes up the high priority pthread will occur exactly at the moment the low priority task is holding a semaphore to tg_filelist:

https://github.com/apache/incubator-nuttx/blob/dd70d29d4e919c47951926b34c5dc88785592e6f/fs/inode/fs_files.c#L375-L381

One of the solution can be protecting this code with a critical section. This should prevent interrupts when the semaphore is taken. But I have no idea if it is a correct solution and will there any side effects.

zouboan commented 2 years ago

@raiden00pl critical_section is a good idea, after all, the wakes up delay of high priority pthread caused by the critical_section in low priority task is short than the time interrupted by the low priority task when high priority pthread runing, but it's hard to operate, because the enter_critical_section() should to take effect immediately after return of semaphore was taken, otherwise, it may still be interrupted.

xiaoxiang781216 commented 2 years ago

Another solution is switch to read-write lock, so there is no the contention in the most case except open/close.

zouboan commented 2 years ago

Yes, both nx_vioctl and nxmq_receive just read tg_filelist, so ReadWriteLock is useful, but, does nuttx support read-write locks?

xiaoxiang781216 commented 2 years ago

nuttx implement pthread_rwlock_t, but we can't use pthread api directly inside kernel. It isn't hard to implement rwlock on top of sem_t.

zouboan commented 2 years ago

@xiaoxiang781216 Does Xiaomi has any plans to implement read-write-locks recently? If not, I will try to implement it

xiaoxiang781216 commented 2 years ago

Sorry, no plan yet.

zouboan commented 2 years ago

@xiaoxiang781216 I implemented a read-write lock, based on semaphore: https://github.com/zouboan/incubator-nuttx/blob/fft/sched/semaphore/rwlock.c and apply to fs_files.c https://github.com/zouboan/incubator-nuttx/blob/fft/fs/inode/fs_files.c It seems to be ok tested on hardware, but I don't know if the implementation is a bad implementation, and whether should I push to the master branch or not.

xiaoxiang781216 commented 2 years ago

could you upstream your change? So, the community can review it.

xiaoxiang781216 commented 1 year ago

@zouboan how about you change for rw lock?

pkarashchenko commented 1 year ago

Is enabling priority inheritance a good option here with leaving existing mutex approach?

raiden00pl commented 1 year ago

Priority inheritance has no effect here. This is a classic readers-writers problem where we have many readers with concurrent access to task group data. rwlock is the correct solution here.

pkarashchenko commented 1 year ago

@raiden00pl then I need to more detailed explanation here. I agree that reader-writer problem may be an issue here, but would like to understand all the timing details.

  1. I see that in fs_getfilep the semaphore/mutex is used to protect access to fl_files file list and obtaining the pointer to a file structure seems to be a very fast operation since only a simple math of / and % is used (fd / CONFIG_NFILE_DESCRIPTORS_PER_BLOCK and fd % CONFIG_NFILE_DESCRIPTORS_PER_BLOCK)
  2. As soon as a pointer to file structure is obtained the semaphore/mutex is released so if there is a high priority thread waiting on that semaphore the context switch will occur and control will be transferred to it.
  3. According to all the descriptions from all the above comments do I understand correctly that exactly the computation delay to calculate file index in the list plus a context switch are the key factor that introduce a delay that leads to a failure to complete the control in time?
raiden00pl commented 1 year ago

@pkarashchenko here is the context to this issue: https://lists.apache.org/thread/g7n5hvb2d33mqdf21v5j9hqtcthol9wk

A brief summary of what I remember:

  1. we have 2 threads that share task group data: a. L- low priority and low frequency FOC main task that handle various low priorty works, including sampling bus voltage or input from the user b. H - high priority and high frequency FOC control loop, that controls motor currents; the frequency of execution ranges from a few kHz to >10kHz, depending on the configuration

  2. H is synchronized with a semaphore - read operation from FOC driver is always blocking: https://github.com/apache/nuttx/blob/b00237ac01f3790f8d6f6ce04c3fb62fb9f39f71/drivers/motor/foc/foc_dev.c#L852 https://github.com/apache/nuttx/blob/b00237ac01f3790f8d6f6ce04c3fb62fb9f39f71/drivers/motor/foc/foc_dev.c#L656

  3. L takes fs/inode/fs_files.c semaphore.

  4. FOC interrupt happens exactly when L holds fs/inode/fs_files.c semaphore.

  5. H is awake and executes the control code

  6. H needs to perform some IO on fs (e.g. read encoder data or write new PWM settings to the FOC device) but L still holds fs/inode/fs_files.c semaphore.

  7. H passes execution to L and then back to H a moment later. We have to perform 2 context switches which is sufficient to mess up the controller.

162555557-7ea8c66f-82e3-4920-b49a-e16eea82264e

This phenomenon is very rare because the high priority task wake-up must happen exactly when fs/inode/fs_files.c is taken by the low priority task.

In my case, the event happened once in hundreds of thousands of correctly performed control cycles. For the motor controller, it's quite easy to detect because it manifests itself in strange noises coming from the motor. In other applications, this bug is either too rare to be observed or its consequences are too insignificant to be observable.

pkarashchenko commented 1 year ago

Just for my understanding, what if the timer (or any other peripheral) interrupt happens right after H is awakened and executing a control code? That will lead to two context switches even if the file list semaphore/mutex is free. I mean, will switching to rwlock implement a bulletproof solution?

xiaoxiang781216 commented 1 year ago

@raiden00pl how about switch to the critical section? we also found that semaphore is much slower than critical section.

raiden00pl commented 1 year ago

Just for my understanding, what if the timer (or any other peripheral) interrupt happens right after H is awakened and executing a control code? That will lead to two context switches even if the file list semaphore/mutex is free. I mean, will switching to rwlock implement a bulletproof solution?

A thread with a control loop must always be set to the highest priority. In that case interrupt won't cause full context switch. We should just handle interrupt and return to the control loop.

Currently we have to do much more work:

  1. handle interrupt,
  2. full context switch to the low priority task,
  3. small work in the low priority task,
  4. full context switch to the high priority task.

The easiest solution to this FOC specific problem is to simply reduce the frequency of the control loop. The case presented here concerns a situation where we want to squeeze the most out of MCU.

But if we consider hard real time applications in general, this issue can lead to serious bugs as we have very rare events with much higher jitter than normal. So it's worth solving anyway.

@raiden00pl how about switch to the critical section? we also found that semaphore is much slower than critical section.

This is probably the easiest solution, but I don't know if it won't have any negative effects, especially for SMP. rwlock would probably be useful elsewhere on the OS as well

pkarashchenko commented 1 year ago

@xiaoxiang781216 I'm not sure what exactly should be protected with a critical section. Based on what I read the FOC control sequence should be executed as non-interruptable portion of code after FOC interrupt. The only way to do this I think is to execute FOC control action from the FOC interrupt handler, otherwise interruption may happen (either file list mutex wait or other interrupt) right after FOC interrupt exits. Or do you suggest running the FOC control task under the critical section and interrupt only to wait on a semaphore posted by FOC interrupt (similar to a worker thread)? With such an approach change in file list protection is required (rwlock or critical section).

pkarashchenko commented 1 year ago

For the SMP case H and L can run in parallel on different cores, so critical section will not help and most probably spinlock will be required, so going with rwlock seems to be the best option.

raiden00pl commented 1 year ago

Protecting this line with a critical section fixed the problem for FOC from what I remember: https://github.com/apache/nuttx/blob/dd70d29d4e919c47951926b34c5dc88785592e6f/fs/inode/fs_files.c#L378

The question is, would replacing _files_semtake with a critical section not be enough?

Based on what I read the FOC control sequence should be executed as non-interruptable portion of code after FOC interrupt. The only way to do this I think is to execute FOC control action from the FOC interrupt handler, otherwise interruption may happen (either file list mutex wait or other interrupt) right after FOC interrupt exits.

It all depends on your application. Not breaking the control loop is not a requirement here, what is required is the execute the loop with a certain deadline - i.e. before start the next loop cycle. And here is the responsibility of the dev to properly design application, including selecting the appropriate frequency for the control loop. But this becomes difficult when in random, rare situations the jitter increases by an order of magnitude.

pkarashchenko commented 1 year ago

We can't use critical section since file list is growing dynamically by files_extend() that use kmm_realloc() that tries to take the heap protection mutex, so with critical section can be interrupted if other thread is holing heap mutex. That is why I think that rwlock might be a better solution.

zouboan commented 1 year ago

@xiaoxiang781216 sorry I did not continue with my work, I implemented a read-write lock, based on semaphore last year: https://github.com/zouboan/nuttx/commit/a9ff305f2d365462513c28f416134152f862c749

but I feel my implementation was something dumb, for example, the read lock implementation as follow:rwlock_read_lock flags = enter_critical_section(); ret = nxsem_wait_uninterruptible(&lock->read); if (ret < 0) { goto errout; } lock->reads++; if (lock->reads == 1) { ret = nxsem_wait_uninterruptible(&lock->write); if (ret < 0) { goto errout; } } ret = nxsem_post(&lock->read); leave_critical_section(flags); there need critical section protect between get read lock and get write lock, and at that time I thought why not use critical section on *filep = &list->fl_files[fd / CONFIG_NFILE_DESCRIPTORS_PER_BLOCK] [fd % CONFIG_NFILE_DESCRIPTORS_PER_BLOCK]; directly? maybe the community has a better implementation

raiden00pl commented 1 year ago

@zouboan could you prepare a PR anyway ? It will be easier to get feedback from the community that way.

zouboan commented 1 year ago

@raiden00pl OK, Thanks for your advice, let me have a check if there's conflicts with the master branch and prepare a PR

raiden00pl commented 10 months ago

Can we use rwlock_t from https://github.com/apache/nuttx/pull/10776 to solve this issue ? atomic_int is C11 feature I think, so probably it won't be available on all supported platforms.

I was testing rwlock_t from https://github.com/apache/nuttx/pull/9739 before but it was too slow. This new approach is faster and appears to solve the problem for FOC.

zouboan commented 10 months ago

Nice! Let me have a test.

raiden00pl commented 10 months ago

I came back to this problem.

Why not just add critical section like suggested before (only in this specific place):

  flags = enter_critical_section();             // <<<<<<---------------- must be here

  /* The descriptor is in a valid range to file descriptor... Get the
   * thread-specific file list.
   */

  /* And return the file pointer from the list */

  ret = nxmutex_lock(&list->fl_lock);
  if (ret < 0)
    {
      leave_critical_section(flags);              // <<<<<<<<<<<-------- here
      return ret;
    }

  *filep = &list->fl_files[fd / CONFIG_NFILE_DESCRIPTORS_PER_BLOCK]
                          [fd % CONFIG_NFILE_DESCRIPTORS_PER_BLOCK];

  /* if f_inode is NULL, fd was closed */

  if (!(*filep)->f_inode)
    {
      *filep = NULL;
      ret = -EBADF;
    }

  nxmutex_unlock(&list->fl_lock);
  leave_critical_section(flags);              // <<<<<<<<<<<-------- here

If the root of the problem is a potential interrupt occurring exactly between nxmutex_lock() and nxmutex_unlock(), then the easiest fix is to disable the interrupts at this point (must be before nxmutex_lock()).

With above change the maximum execution time of my control loop drops from 8864 cycles to 4615 cycles which looks like it solves the problem.

For SMP we probably should use spin_lock_irqxxx() here. rwlock would be ideal, but implementing an effective rwlock is not an easy task and rwlock spinlock that is already upstream will lead to deadlocks on non SMP targets.

zouboan commented 10 months ago

I tried use rwlock_t in #10776 to solve this issue, but unfortunately, there are some compilation errors:

CXX:  libcxxmini/libxx_delete.cxx In file included from /home/zouboan/d/FeedForward/nuttx/include/nuttx/spinlock.h:36,
                 from /home/zouboan/d/FeedForward/nuttx/include/nuttx/fs/fs.h:42,
                 from /home/zouboan/d/FeedForward/nuttx/include/nuttx/lib/lib.h:30,
                 from libcxxmini/libxx_delete.cxx:26:
/home/zouboan/gcc-arm-none-eabi-10-2020-q4-major/lib/gcc/arm-none-eabi/10.2.1/include/stdatomic.h:40:9: error: '_Atomic' does not name a type

because c++ file can't include stdatomic.h, which result spinlock.h can't be included by c++ file.

I tried solve this problem by:

#if defined(__cplusplus)
#include <atomic>
#  define Atomic(X) std::atomic< X >
#else
#include <stdatomic.h>
#endif

but it leads to another compilation errors:

CXX:  libcxxmini/libxx_delete.cxx In file included from /home/zouboan/d/FeedForward/nuttx/include/nuttx/fs/fs.h:42,
                 from /home/zouboan/d/FeedForward/nuttx/include/nuttx/lib/lib.h:30,
                 from libcxxmini/libxx_delete.cxx:26:
/home/zouboan/d/FeedForward/nuttx/include/nuttx/spinlock.h:37:10: fatal error: atomic: No such file or directory
   37 | #include <atomic>
raiden00pl commented 10 months ago

@zouboan rwlock from https://github.com/apache/nuttx/pull/10776 will work only for SMP, otherwise it may cause deadlocks under certain conditions. So unfortunately it won't help us either

zouboan commented 10 months ago

Why not just add critical section like suggested before (only in this specific place):

I think use spin_lock_irqsave and spin_unlock_irqrestore directly is reasonable, the wakes up delay of high priority pthread caused by the disable the interrupts in low priority task is short than the time context switch to the low priority task when high priority pthread runing, it achieves the similar goal but is not as Inefficient as #9739