enjoy-digital / litex

Build your hardware, easily!
Other
2.89k stars 555 forks source link

liteuart kernel hang #1399

Open staticfloat opened 2 years ago

staticfloat commented 2 years ago

When using the liteuart driver, we have run into a few kernel hangs that are reproducible. Opening multiple litex_term processes to talk to a single device usually triggers it, which is more understandable, however we have also found that the same issue occurs when user A attempts to communicate to the device, whereas user B can communicate just fine, which is much more puzzling.

The kernel hang is surfaced in dmesg as the following message:

[ 1307.929783] CPU: 6 PID: 20668 Comm: litex_term Tainted: G           OE     5.15.0-46-generic #49-Ubuntu                                                                                   
[ 1307.929785] Hardware name: Supermicro C9Z390-CG/C9Z390-CG, BIOS 1.2 11/18/2019                                                                                                            
[ 1307.929785] RIP: 0010:liteuart_start_tx+0x56/0xb0 [liteuart]                               
[ 1307.929789] Code: 48 63 c8 83 c0 01 25 ff 0f 00 00 0f b6 0c 0e 89 82 74 01 00 00 83 87 e4 00 00 00 01 eb 02 f3 90 48 8b 47 10 48 83 c0 04 8b 00 <84> c0 75 f0 48 8b 47 10 89 08 8b 82 74 0
1 00 00 39 82 70 01 00 00                      
[ 1307.929790] RSP: 0018:ffffa39ec4597c18 EFLAGS: 00000086                                    
[ 1307.929791] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 000000000000000a                                                                                                             
[ 1307.929791] RDX: ffff8b0018a14600 RSI: ffff8b0125609000 RDI: ffff8b0054c08428                                                                                                             
[ 1307.929792] RBP: ffffa39ec4597c30 R08: ffff8b000ec31608 R09: ffff8b000ec31608                                                                                                             
[ 1307.929792] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b0054c08428                                                                                                             
[ 1307.929793] R13: ffff8b001198e801 R14: 0000000000000000 R15: ffff8b0018a14600                                                                                                             
[ 1307.929793] FS:  00007f9dc9f21640(0000) GS:ffff8b0f2dd80000(0000) knlGS:0000000000000000                                                                                                  
[ 1307.929794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                              
[ 1307.929795] CR2: 00003d04f1a37048 CR3: 0000000211644003 CR4: 00000000003706e0                                                                                                             
[ 1307.929795] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000                                                                                                             
[ 1307.929796] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400                                                                                                             
[ 1307.929796] Call Trace:                     
[ 1307.929797]  <TASK>                         
[ 1307.929798]  ? __uart_start.isra.0+0x5e/0x70                                               
[ 1307.929801]  uart_write+0x101/0x1d0         
[ 1307.929803]  n_tty_write+0x216/0x3b0                                                       
[ 1307.929804]  ? __wake_up_pollfree+0x50/0x50                                                
[ 1307.929806]  do_tty_write+0x12d/0x270                                                      
[ 1307.929807]  ? __cond_resched+0x1a/0x50                                                    
[ 1307.929809]  ? eraser+0x4d0/0x4d0           
[ 1307.929810]  file_tty_write.constprop.0+0x93/0xc0                                          
[ 1307.929811]  tty_write+0x11/0x20            
[ 1307.929812]  new_sync_write+0x114/0x1b0                                                    
[ 1307.929814]  vfs_write+0x1d5/0x270          
[ 1307.929816]  ksys_write+0x67/0xf0           
[ 1307.929817]  __x64_sys_write+0x19/0x20                                                     
[ 1307.929817]  do_syscall_64+0x59/0xc0                                                       
[ 1307.929819]  ? syscall_exit_to_user_mode+0x27/0x50                                         
[ 1307.929820]  ? do_syscall_64+0x69/0xc0                                                     
[ 1307.929821]  entry_SYSCALL_64_after_hwframe+0x61/0xcb                                      
[ 1307.929823] RIP: 0033:0x7f9dcb4daa6f                                                       
[ 1307.929824] Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 19 c0 f7 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 4
4 24 08 e8 5c c0 f7 ff 48                      
[ 1307.929824] RSP: 002b:00007f9dc9f20320 EFLAGS: 00000293 ORIG_RAX: 0000000000000001                                                                                                        
[ 1307.929825] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f9dcb4daa6f                                                                                                             
[ 1307.929826] RDX: 0000000000000001 RSI: 00007f9dca92f590 RDI: 0000000000000003                                                                                                             
[ 1307.929826] RBP: 00007f9dc9f215c0 R08: 0000000000000000 R09: 0000000000000000                                                                                                             
[ 1307.929827] R10: 00007f9dca92f590 R11: 0000000000000293 R12: 0000000000000003                                                                                                             
[ 1307.929827] R13: 00007f9dca92f590 R14: 0000000000000003 R15: 00005650dc9fcde0                                                                                                             
[ 1307.929828]  </TASK>

This is using the 2022.04 release of litex, communicating with the board defined in the xtrx_julia repository.

Dolu1990 commented 2 years ago

Hi, I don't know if it is related, but currently, i'm trying to run debian on NaxRiscv, and got kind of a similar issue with the liteuart, it goes well for 60 secondes of boots, and then, whe iti try to bind liteuart0, things turn wrong, generaly, the timer in linux get called endlessly while the time base do not advance.

Not sure if my issue is releated to a NaxRiscv bug (as it is quite fresh), or a driver bug.

Dolu1990 commented 2 years ago

(note, from memory my stack trace is quite different

staticfloat commented 2 years ago

Yes, I think it's likely different. In my case, I think we may have tracked it down to old liteuart module versions being used with newer gateware; because the register mapping might change with a new gateware, we believe it could cause this kind of an issue.