[XDMA] BUG: scheduling while atomic in engine_service_poll

eniv commented 1 year ago

Hello, I have recently updated to kernel 6.2.0 (EDIT: this also happens on an older kernel 5.19) Running XDMA in polled mode causes a kernel crash in engine_service_poll (libxdma.c). This is the relevant section from dmesg:

[  351.813899] Call Trace:
[  351.813900]  <TASK>
[  351.813901]  dump_stack_lvl+0x48/0x70
[  351.813905]  dump_stack+0x10/0x20
[  351.813909]  __schedule_bug+0x64/0x80
[  351.813913]  __schedule+0x48f/0x5f0
[  351.813916]  ? raw_spin_rq_unlock+0x10/0x40
[  351.813919]  ? finish_task_switch.isra.0+0x84/0x290
[  351.813922]  schedule+0x68/0x110
[  351.813926]  engine_service_poll+0x94/0x150 [xdma]
[  351.813935]  xdma_thread_cmpl_status_proc+0x24/0x40 [xdma]
[  351.813943]  xthread_main+0x151/0x250 [xdma]
[  351.813952]  ? __pfx_autoremove_wake_function+0x10/0x10
[  351.813956]  ? __pfx_xthread_main+0x10/0x10 [xdma]
[  351.813965]  kthread+0xeb/0x120
[  351.813968]  ? __pfx_kthread+0x10/0x10
[  351.813973]  ret_from_fork+0x29/0x50
[  351.813976]  </TASK>
[  351.814395] BUG: scheduling while atomic: cmpl_status_th5/3296/0x00000000

This doesn't happen if the driver is loaded in interrupt mode Looks like a blocking call inside a critical section?

eniv commented 1 year ago

It seems like the issue comes from the schedule() call in engine_service_wb_monitor :

/*
* Define NUM_POLLS_PER_SCHED to limit how much time is spent
* in the scheduler
*/

if (sched_limit != 0) {
    if ((sched_limit % NUM_POLLS_PER_SCHED) == 0)
        schedule();
}
sched_limit++;

paulmnt commented 1 year ago

Hello, thank you for reaching out. I have not run into this issue yet and I have only scratched the surface of this driver in order to port it to more recent versions of Linux.

Xilinx has been ignoring PRs on this for a while, but I think it would still be best to log the issue on the main repository. Happy to take a look at your commit, if that can help.

eniv commented 1 year ago

Hi, I gave up on Xilinx support. They either don't care or maybe the person who wrote the driver left and they never bothered finding a replacement.

Anyway, after I saw that there is on going work to completely redo the driver in the mainline kernel, starting at 6.3 I decided to hold off on digging deeper.

Also, keep in mind that this only happens when streaming in poll mode (you have to insert the kernel module with poll_mode=1 as explained here)

cbryant203 commented 1 year ago

I have created a PR against paul's fork which works for me. The problem is that threads do lock_thread() and run through the work queue calling the fproc() function pointer while still holding the lock. This points to xdma_thread_cmpl_status_proc() which ends up in engine_service_wb_monitor() where it decides to call schedule() if it has waited too long. It is a bad design to hold a lock over such a complex operation - especially one which uses a callback as the called function is not supposed to rely on internal details of the caller.

My simple fix is to drop the lock while processing each work item. However, I cannot be certain this is entirely correct. The work item itself should be fine, and the work list is protected by the lock, but if there is some way that pending work can get deleted from the work list, then we could have the list A->B->C with the thread working on A and holding a pointer to B in its next pointer (the one in list_for_each_safe()). Then if something deletes B, when the thread resumes and follows its pointer, it accesses the deleted item.

A better design would be for work items to be dequeued as each it ready to work on, but it's not clear to me if working on a work item is a done only once or whether we might need to traverse the list, find some item is not dealt with and leave it on the list for later. Needless to say, the source code is totally lacking on documentation of this sort of high-level information, so I have been trying to reverse-engineer the invariants etc.

eniv commented 1 year ago

I tested your suggestion with multiple C2H & H2C streams and it seems to work OK. However, I admit that it is also tricky for me to identify if there is an asynchronous code path that would affect the work_list or even the work_item (I am not 100% convinced that the work_item is safe as you mention).

sofiachebotareva commented 1 year ago

Hello @eniv

My name is Sofia, I'm an Account Manager at Jungo Connectivity.

I noticed you've faced some challenges during your driver development project. I’d like to introduce you to WinDriver, our comprehensive driver development toolkit designed specifically for PCI/PCIe and USB devices. WinDriver is equipped with an array of debugging tools that could potentially assist you in resolving the issue at hand.

We offer a 30-day free trial for our solution, allowing you to evaluate its functionalities and benefits without any obligations.

Would you be interested in receiving more detailed information and a link to access the free evaluation?

cbryant203 commented 1 year ago

My reasoning about work_item access being safe is as follows. Assuming the interrupt modes are correct (ok, maybe that's the weak point), when used in interrupt mode, a work queue item invokes engine_service_work() which calls engine_service(). In poll mode, the thread calls xdma_thread_cmpl_status_proc() which calls engine_service_poll() which calls engine_service_wb_monitor() and engine_service(). While the thread uses lock_thread(), the interrupt mode never uses this lock, so it must be unnecessary. Both variants use engine->lock, suggesting that this is intended to be sufficient.

However, as I mentioned, the weak point in this argument is the assumption that the interrupt modes are correct. For example, I find it very suspicious that INIT_WORK() is used in engine_init(), but engine_destroy() does not use flush_workquere() or destroy_workqueue() making me wonder if there is a bug whereby there might still be work on the work queue when the engine is being deleted.

eniv commented 1 year ago

That makes some sense. I like your line of thinking. It seems like lock_thread and unlock_thread only provide async safety between functions implemented in xdma_thread.c and those are all functions that are called from libxdma in poll mode only. The looping calls between the two files is really a terrible design practice.

Xilinx / dma_ip_drivers

[XDMA] BUG: scheduling while atomic in engine_service_poll #229