apache / nuttx

Apache NuttX is a mature, real-time embedded operating system (RTOS)
https://nuttx.apache.org/
Apache License 2.0
2.82k stars 1.17k forks source link

Thread holding spinlock can be blocked. #9531

Open patacongo opened 1 year ago

patacongo commented 1 year ago

@MasayukiIshikawa @masayuki2009 I was just looking at the implementation of spin_lock_irqsave().

spin_lock_irqsave() basically just:

  1. Calls irqsave() to disable local interrupts, and
  2. Calls spin_lock() to lock the spinlock.

The waiting thread then holds the lock and the following code section is protected.

However, it is possible that another thread running on a different CPU could cause a higher priority thread to be started and that could cause the thread holding the spinlock to be blocked? It looks like that could happen and, if so, that would be bad. Am I missing something?

A simple solution would be to call sched_lock() from spin_irq_lock() while holding the spinlock and call sched_unlock() when the spinlock is unlocked. That is reasonably lightweight and should guarantee that the thread holding the spinlock runs on the CPU until it relinquishes the spinlock.

I suppose that issue would not be unique to spin_lock_irqsave() but would really apply to all use of spinlocks.

This is similar to another issue that added a call to sched_lock() in enter_critical_section because the thread holding the critical section was being suspended. In that case, there is no real, hard error, but the results could be confusing and counterintuitive. This seems like the same problem but the results could be a real problem in certain cases where keeping the spinlock locked indefinitely effects system performance.

@xiaoxiang781216 has more experience with that related issue. Please comment on this issue.

xiaoxiang781216 commented 1 year ago

we are trying to call sched_lock in critical section(could be for spin lock), but it isn't very easy because:

  1. sched_lock call enter_critical_section internally
  2. The code of sched_lock is very long, directly call it in spinlock_irqsave reduce the performance a lot.

However, isn't it possible that another thread running on a different CPU could cause a higher priority thread to be started and that could cause the thread holding the spinlock to be blocked? It looks like that could happen and, if so, that would be bad. Am I missing something?

Even thread running on one CPU trigger a scheduling on other CPU, the target CPU which call spin_lock_irqsave will postpone the request after spin_unlock_restore is called since IPI triggering schedule is also blocking too.

patacongo commented 1 year ago
  1. sched_lock call enter_critical_section internally

Perhaps that critical section in sched_lock() might be replaced with something similar spin_lock_irqsave(). The problem is that logic like spin_setbit(), spin_clrbit() and the call to nxsched_merge_prioritized() do require a critical section. So it is inescapable in the current design.

  1. The code of sched_lock is very long, directly call it in spinlock_irqsave reduce the performance a lot.

Does enter_critical_section() now call sched_lock()? I don't see that call but I know we talked about that in the past.

Since spin_lock() already calls enter_critical_section(), my concern in this case will be fixed when the general issue is fixed.

Also, my preference would be lose a bit of performance if that is what is necessary to avoid the hard performance bug that would occur if the a thread is blocked indefinitely while holding a spinlock.

mu578 commented 1 year ago

Let's say; you want to explore fringe; let's say you want to tune and create a kind of high performance lock-free system / mode, such within a specific and simple hardware/environment; let's say having two CPUs ;

you could have a hard-limit on spin-cycles; then test clock timing ; light-job, mid-job, heavy-job etc ; each time you hit hard-limit you log ;

thus, you could draw a policy / control for a thread going lalaland in that setup.

patacongo commented 1 year ago

Suspending the holder of a spinlock can cause a myriad of failures. Suspending the holder does not free the spinlock; the spinlock remains locked while the holder is blocked. In the worst case, this can cause deadlocks since the holder of the spinlock is blocked and may not have the priority to run again to release it.

Another spinlock performance issue: #1488

The point of this issue, @mu578 your comments, and #1488 is that better management of spinlocks is needed for them to really work as needed in SMP. And, as @xiaoxiang781216 has pointed out, there are some tricky design issues that would have to be addressed to accomplish that.

Perhaps we could do something as simple as adding a in-a-spinlock flag in the TCB which would very temporarily lock the scheduler just like lockcount > 0? Very simple but also kind of kludgey.

TaiJuWu commented 1 year ago

I found taking spinlock will disable interrupt and enable interrupt when releasing spinlock in zephyr project. It can avoid the spinlock holder be blocked. Maybe we could follow them? reference: zephyr

patacongo commented 1 year ago

I found taking spinlock will disable interrupt and enable interrupt when releasing spinlock in zephyr project. It can avoid the spinlock holder be blocked. Maybe we could follow them? reference: zephyr

In SMP mode, disabling local interrupts will not prevent the thread from being blocked. But locking the scheduler with sched_lock() will.

TaiJuWu commented 1 year ago

I found taking spinlock will disable interrupt and enable interrupt when releasing spinlock in zephyr project. It can avoid the spinlock holder be blocked. Maybe we could follow them? reference: zephyr

In SMP mode, disabling local interrupts will not prevent the thread from being blocked. But locking the scheduler with sched_lock() will.

There are two method enter scheduler:

  1. thread give up its cpu by himself (like mutex, event group)
  2. interrupts force cpu to give up cpu (like timer, IPI...)

We just prevent scheduling from interrupt and first one should be guarantee by programmer. If I am wrong, please correct me.

patacongo commented 1 year ago

There are two method enter scheduler:

1. thread give up its cpu by himself (like mutex, event group)

2. interrupts force cpu to give up cpu (like timer, IPI...)

We just prevent scheduling from interrupt and first one should be guarantee by programmer. If I am wrong, please correct me.

Also, a more generalized case of your 2) is:

  1. A higher priority thread needs to run in SMP mode and the thread holding the semaphore becomes the lowest priority running thread. Then the thread holding the spinlock is blocked.

Disabling interrupts, of course, has bad effects on real time performance. Interrupts would have to be disabled throughout the time that the spinlock is held. So I don't like Zephyr's solution very much.

In SMP machines, different interrupts can be handled by different CPUs. So disabling local CPU interrupts doesn't work. NuttX does have enter/leave_critical_section() which will disable interrupts on all CPUs, but it is very expensive (it uses more spinlocks and some inter-CPU messaging).

sched_lock() works better than disabling interrupts. It prevents the thread from being blocked on the CPU without disabling interrupts. Interrupts can still occur and the kernel threads can still respond to interrupt event (but on a different CPUs). So there should be no loss of real time performance.

Giving up the CPU to momentarily service an interrupt is usually OK; when the interrupt returns, it will always return to the thread holding the spinlock. The thread stays in place if the scheduler is locked.

The only way that the thread could be blocked is if, as you say, in 1), if the thread holding the spinlock voluntarily blocks itself. That would be very bad situation in a real time system and should be avoided.

patacongo commented 1 year ago

sched_lock() works better than disabling interrupts. It prevents the thread from being blocked on the CPU without disabling interrupts. Interrupts can still occur and the kernel threads can still respond to interrupt event (but on a different CPUs). So there should be no loss of real time performance.

Hmmm... this might not be true. I would need to review the code. At least at one time, the logic locked all threads in place on all CPUs. So new ready-to-run threads couldn't run but would go in the pending task list until the scheduler is unlocked.

mu578 commented 1 year ago

Suspending the holder of a spinlock can cause a myriad of failures. Suspending the holder does not free the spinlock; the spinlock remains locked while the holder is blocked. In the worst case, this can cause deadlocks since the holder of the spinlock is blocked and may not have the priority to run again to release it.

Another spinlock performance issue: #1488

The point of this issue, @mu578 your comments, and #1488 is that better management of spinlocks is needed for them to really work as needed in SMP. And, as @xiaoxiang781216 has pointed out, there are some tricky design issues that would have to be addressed to accomplish that.

Perhaps we could do something as simple as adding a in-a-spinlock flag in the TCB which would very temporarily lock the scheduler just like lockcount > 0? Very simple but also kind of kludgey.

that was my point, indeed starting with a narrowed environment, then experiencing different isolated strategies, I don't think one can solve this kind of model by just a theoretical approach, it is too practical driven, maybe two different types of scheduling for high and low priority groups would be a solution.