Deniz-Eren / dev-can-linux

Porting of Linux CAN-bus drivers to QNX
GNU General Public License v2.0
4 stars 1 forks source link

Rare race condition found impacting file close #66

Closed Deniz-Eren closed 1 month ago

Deniz-Eren commented 1 month ago

Describe the bug

During aggressive open/close scenario when active message reads/writes are happening, a rare race-condition found to lock up the driver causing a crash.

Rare race condition found in void* rx_loop (void* arg) impacting cleaning up blocked clients during file close.

To Reproduce

After running 'ctest' repeated, the issue is replicated:

ctest

Test project /root/userhome/Repos/outlook/dev-can-linux/build
    Start 1: ssh-driver-baud-tests
1/6 Test #1: ssh-driver-baud-tests ............   Passed    0.72 sec
    Start 2: ssh-driver-integrity-tests
2/6 Test #2: ssh-driver-integrity-tests .......   Passed    5.14 sec
    Start 3: ssh-driver-io-tests
3/6 Test #3: ssh-driver-io-tests ..............   Passed   43.11 sec
    Start 4: ssh-driver-raw-tests

DRIVER STUCK

Top shows CPU runaway:

CPU states: 82.0% user, 0.0% kernel
CPU  0 Idle: 16.8% 
CPU  1 Idle: 19.0% 
Memory: 4095M total, 3892M avail, page size 4K

    PID   TID PRI STATE    HH:MM:SS    CPU  COMMAND             
84344854     4  10 Run       0:00:05  47.33% dev-can-linux       
184333     2  10 Rcv       1:04:09  30.90% io-pkt-v6-hc        
84344854    34  10 Rcv       0:00:58   2.10% dev-can-linux       
155658     1  10 Rcv       0:23:29   0.79% devc-pty            
647187     1  10 SigW      0:12:40   0.68% sshd                
143369     4  10 Rcv       0:00:12   0.15% pipe                
84344854    39  10 Rply      0:00:00   0.04% dev-can-linux       
        1    15  10 Run       0:00:00   0.00% kernel              
88072222     1  10 Rply      0:00:00   0.00% top                 
1191959     1  10 SigW      0:00:01   0.00% sshd                

            Min        Max       Average 
CPU 0 idle:   16%        24%        19% 
CPU 1 idle:   16%        23%        19% 
Mem Avail:   3892MB      3895MB      3893MB  
Processes:    30         30         30    
Threads:     112        112        112    

Checking pidin we see tests/driver/raw/driver-raw-tests is still running:

#pidin arg
    pid Arguments
84344854 dev-can-linux -Ex -vvv 
88043549 /root/userhome/Repos/outlook/dev-can-linux/build/tests/driver/raw/driver-raw-tests 

Log of driver 'dev-can-linux -Ex -vvv' shows:

CAN_DEVCTL_RX_FRAME_RAW_: _RESMGR_NOREPLY
io_devctl -> id: 0
CAN_DEVCTL_RX_FRAME_RAW_: _RESMGR_NOREPLY
io_devctl -> id: 0
CAN_DEVCTL_RX_FRAME_RAW_: _RESMGR_NOREPLY
io_devctl -> id: 0
...
CAN_DEVCTL_RX_FRAME_RAW_: _RESMGR_NOREPLY
io_devctl -> id: 0
CAN_DEVCTL_RX_FRAME_RAW_: _RESMGR_NOREPLY
io_devctl -> id: 0
CAN_DEVCTL_RX_FRAME_RAW_: _RESMGR_NOREPLY
io_devctl -> id: 0

After CTRL-C shows:

Shutdown IRQ loop
Shutdown program
io_devctl -> id: 0
Shutting down adv_pci
Removing card
Removing device adv_pci-can0
io_devctl -> id: 0
io_devctl -> id: 0
io_devctl -> id: 0
io_devctl -> id: 0
io_devctl -> id: 0
unregister_netdev: adv_pci-can0
netif_stop_queue
free_irq; irq: 258
cancel_delayed_work_sync (3c52da2e10)
netif_tx exit: adv_pci-can0
timer_loop shutdown
io_devctl -> id: 0
io_close_ocb -> id: 0
can_ocb_free -> /dev/can0/rx0

Still stuck.

Further attempt to stop with 'kill -s SIGKILL ' shows:

Killed 

When in stuck state, and we test using candump and cansend as follows, the result is a core dump when cansend is run.

candump -u0,rx0 
/dev/can0/rx0 TS: 162329081ms [EFF] 1234 [2] AB CD 00 00 00 00 00 00
devctl CAN_DEVCTL_RX_FRAME_RAW_BLOCK: No such process

cansend -u0,tx0 -m0x1234,1,0xABCD

Then driver prints the following logs during this stuck state test:

io_devctl -> id: 0
io_open -> (id: 1, rcvid: 10)
can_ocb_calloc -> /dev/can0/tx0 (id: 1)
io_devctl -> id: 1
netif_stop_queue
CAN_DEVCTL_TX_FRAME_RAW; /dev/can0/tx0 TS: 0ms [EFF] 1234 [2] AB CD 00 00 00 00 00 00
netif_rx; adv_pci-can0 [EFF] 1234 [2] AB CD  0  0  0  0  0  0
netif_wake_queue
io_close_ocb -> id: 1
can_ocb_free -> /dev/can0/tx0
io_devctl -> id: 0
CAN_DEVCTL_RX_FRAME_RAW_BLOCK; /dev/can0/rx0 TS: 162329081ms [EFF] 1234 [2] AB CD 00 00 00 00 00 00
io_devctl -> id: 0
CAN_DEVCTL_RX_FRAME_RAW_BLOCK; /dev/can0/rx0 TS: 162329081ms [EFF] 1234 [2] AB CD 00 00 00 00 00 00
io_devctl -> id: 0

Process 84344854 (dev-can-linux) terminated SIGSEGV code=1 fltno=11 ip=00000049c7d7f8a8(/opt/bin/dev-can-linux@so.3.0+0x000000000000deb3) mapaddr=00000000000118a8. ref=0000002228943ae8
Memory fault (core dumped)

Expected behavior

Should close file without incident.

Screenshots

NA

Platform (please complete the following information)

Driver (please complete the following information)

Additional context

NA