embassy-rs / nrf-softdevice

Apache License 2.0
264 stars 79 forks source link

Fix critical-section impl so it disables all irqs. #163

Closed Dirbaio closed 1 year ago

Dirbaio commented 1 year ago

It was disabling irqs only in ICER[0], but there's more at ICER[1]. Softdevice irqs are always in ICER[0], so ICER[1] doesn't need any masking, but we still need to disable and restore all.

peter9477 commented 1 year ago

As noted in the matrix channel, have been running with this patch on two of our lab units for 15h each, without further failures of our "SPI task stall" issue, when previously we couldn't run for more than somewhere between a few seconds and a few minutes, or about an hour at most.

We've also gone through a function by function analysis of the scenario that must have led to our stall issue, and confirmed that the issue we've been observing would definitely have been a possible outcome of the problem this patches, where the RTC1 timer interrupt was basically being interrupted in the middle of its critical section (which shouldn't be possible) by the SPIM2 interrupt (which wasn't properly disabled before this patch), resulting in a second critical section being entered as though it was a nested critical section in the first one, leading to the race condition which let the embassy task that was doing SPIM2 work end up with the RUN_QUEUED flag set despite not being actually in the run queue, at which point it was in the twilight zone and would never be polled again.

It appears a stress test case could be developed with just one task waking on an embassy Timer::after (maybe a few times a second) while a second task is doing rapid SPIM2 writes many times a second, and as long as their timing isn't synchronized then eventually (I'm guessing within a few minutes at most, for the right combination) the second task would get stalled. If the second task updated a loop counter and the first task read it, it could report the failure as soon as it saw the counter stop incrementing.