Open xuchenCN opened 1 year ago
I am hitting the same thing with ubuntu 22.04, kernel 5.15.0-48-generic, nvidia 510.85.02 on a NVIDIA Corporation GP108 [GeForce GT 1030]. The process does not respond to any signals.
[<0>] uvm_spin_loop+0xd3/0x150 [nvidia_uvm]
[<0>] uvm_tracker_wait+0xce/0x190 [nvidia_uvm]
[<0>] uvm_page_table_range_vec_clear_ptes+0x230/0x350 [nvidia_uvm]
[<0>] uvm_va_range_destroy+0x281/0x490 [nvidia_uvm]
[<0>] destroy_va_ranges.part.0+0x63/0x80 [nvidia_uvm]
[<0>] uvm_user_channel_detach+0x9a/0xd0 [nvidia_uvm]
[<0>] uvm_va_space_detach_all_user_channels+0xa6/0x120 [nvidia_uvm]
[<0>] uvm_va_space_destroy+0x1ed/0x690 [nvidia_uvm]
[<0>] uvm_release.constprop.0+0x42/0xa0 [nvidia_uvm]
[<0>] uvm_release_entry.part.0.isra.0+0x7a/0xa0 [nvidia_uvm]
[<0>] uvm_release_entry+0x2a/0x30 [nvidia_uvm]
[<0>] __fput+0x9f/0x260
[<0>] ____fput+0xe/0x20
[<0>] task_work_run+0x6d/0xb0
[<0>] do_exit+0x21b/0x3c0
[<0>] do_group_exit+0x3b/0xb0
[<0>] get_signal+0x150/0x900
[<0>] arch_do_signal_or_restart+0xde/0x100
[<0>] exit_to_user_mode_loop+0xc4/0x160
[<0>] exit_to_user_mode_prepare+0xa0/0xb0
[<0>] irqentry_exit_to_user_mode+0x9/0x20
[<0>] irqentry_exit+0x1d/0x30
[<0>] sysvec_reschedule_ipi+0x78/0xe0
[<0>] asm_sysvec_reschedule_ipi+0x1a/0x20
It looks like the same issue that I'm currently seeing on PyTorch CI. We are currently using 525.85.05. The curious thing is that this only happens on NVIDIA A10G on https://aws.amazon.com/ec2/instance-types/g5/ used by our CI.
[47056.277731] NMI backtrace for cpu 12
[47056.277731] CPU: 12 PID: 30404 Comm: python Tainted: P OE 4.14.252-195.483.amzn2.x86_64 #1
[47056.277732] Hardware name: Amazon EC2 g5.4xlarge/, BIOS 1.0 10/16/2017
[47056.277732] task: ffff888594df2600 task.stack: ffffc9000e484000
[47056.277733] RIP: 0010:pvclock_clocksource_read+0x29/0xb0
[47056.277733] RSP: 0018:ffffc9000e4878c8 EFLAGS: 00000216
[47056.277734] RAX: 000077d74027b748 RBX: 00000000866c27da RCX: 0000000000000000
[47056.277734] RDX: 00000000ffffffff RSI: 0000000000000004 RDI: ffff88901435b300
[47056.277734] RBP: 0000000000000001 R08: 00003beba013db26 R09: 0000000000000200
[47056.277735] R10: ffffc9000e487800 R11: ffff888fd217f7c8 R12: 000000000162dd2c
[47056.277735] R13: ffffffff829b3580 R14: ffff8881903be420 R15: 0000000000000018
[47056.277735] FS: 00007fc8d76a1700(0000) GS:ffff888fd4300000(0000) knlGS:0000000000000000
[47056.277736] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[47056.277736] CR2: 000000c0017fd000 CR3: 000000000200a000 CR4: 00000000003406e0
[47056.277736] Call Trace:
[47056.277736] kvm_clock_get_cycles+0x1a/0x20
[47056.277737] getrawmonotonic64+0x3e/0xd0
[47056.277737] uvm_spin_loop+0x25/0xd0 [nvidia_uvm]
[47056.277737] uvm_tracker_wait+0x86/0x1d0 [nvidia_uvm]
[47056.277738] uvm_page_table_range_vec_clear_ptes+0x204/0x2f0 [nvidia_uvm]
[47056.277738] ? uvm_va_range_destroy+0x2c6/0x450 [nvidia_uvm]
[47056.277738] uvm_va_range_destroy+0x2c6/0x450 [nvidia_uvm]
[47056.277738] ? os_acquire_spinlock+0xe/0x20 [nvidia]
[47056.277739] ? _nv038527rm+0xc/0x20 [nvidia]
[47056.277739] destroy_va_ranges.part.4+0x4a/0x60 [nvidia_uvm]
[47056.277739] uvm_user_channel_detach+0x99/0x110 [nvidia_uvm]
[47056.277740] uvm_gpu_va_space_detach_all_user_channels+0x3b/0x60 [nvidia_uvm]
[47056.277740] uvm_va_space_detach_all_user_channels+0x3e/0x80 [nvidia_uvm]
[47056.277740] uvm_va_space_destroy+0x17e/0x420 [nvidia_uvm]
[47056.277741] uvm_release.isra.4+0x75/0x90 [nvidia_uvm]
[47056.277741] uvm_release_entry+0x68/0x90 [nvidia_uvm]
[47056.277741] __fput+0xd2/0x210
[47056.277741] task_work_run+0x8a/0xb0
[47056.277742] do_exit+0x390/0xb90
[47056.277742] ? hrtimer_cancel+0x15/0x20
[47056.277742] ? futex_wait+0x1d7/0x260
[47056.277742] do_group_exit+0x3a/0xa0
[47056.277742] get_signal+0x13f/0x790
[47056.277743] do_signal+0x36/0x610
[47056.277743] ? do_futex+0x378/0x4f0
[47056.277743] ? __check_object_size+0xb4/0x190
[47056.277743] ? __audit_syscall_exit+0x231/0x2b0
[47056.277743] exit_to_usermode_loop+0x85/0xc0
[47056.277744] do_syscall_64+0x101/0x110
[47056.277744] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[47056.277744] RIP: 0033:0x7fca19eda065
[47056.277745] RSP: 002b:00007fc8d76a0ad0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[47056.277746] RAX: fffffffffffffdfc RBX: 00007fc9e0844aa8 RCX: 00007fca19eda065
[47056.277747] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007fc9e0844ad0
[47056.277747] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff
[47056.277747] R10: 00007fc8d76a0cf0 R11: 0000000000000246 R12: 00007fc9e0844ad8
[47056.277748] R13: 00007fc9e0844ad0 R14: 00007fc8d76a0cf0 R15: 00007fc9e0844acc
[47056.277748] Code: 00 00 55 53 48 83 ec 08 8b 17 89 d6 83 e6 fe 0f ae e8 0f 31 48 c1 e2 20 48 8b 5f 10 0f b6 6f 1d 48 09 d0 0f be 57 1c 48 2b 47 08 <89> d1 49 89 c0 f7 d9 49 d3 e8 89 d1 48 d3 e0 85 d2 8b 57 18 49
[47056.278326] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.017 msecs
[47056.279280] NMI backtrace for cpu 15
[47056.279281] CPU: 15 PID: 25819 Comm: python Tainted: P OE 4.14.252-195.483.amzn2.x86_64 #1
[47056.279282] Hardware name: Amazon EC2 g5.4xlarge/, BIOS 1.0 10/16/2017
[47056.279282] task: ffff888204e74c00 task.stack: ffffc9000cfe4000
[47056.279282] RIP: 0010:cpuacct_charge+0x68/0x80
[47056.279283] RSP: 0018:ffffc9000cfe7878 EFLAGS: 00000016
[47056.279283] RAX: ffff888fd43d64f0 RBX: ffff888204e74c80 RCX: 0000000000000000
[47056.279284] RDX: ffffffff82065520 RSI: 00000000000000be RDI: ffff888204e74c00
[47056.279284] RBP: 00000000000000be R08: 0000000000000000 R09: 0000000000000200
[47056.279284] R10: ffffc9000cfe7880 R11: ffff888fd217fda8 R12: ffff888ee4a02200
[47056.279285] R13: ffff888ee4a02200 R14: ffff888204e755d0 R15: ffff888204e74c00
[47056.279285] FS: 00007f4023c6b080(0000) GS:ffff888fd43c0000(0000) knlGS:0000000000000000
[47056.279286] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[47056.279286] CR2: 000000c001707000 CR3: 00000001bea28000 CR4: 00000000003406e0
[47056.279286] Call Trace:
[47056.279286] update_curr+0xe1/0x1a0
[47056.279287] pick_next_task_fair+0x98/0x540
[47056.279287] ? kvm_sched_clock_read+0x1a/0x30
[47056.279287] __schedule+0x15e/0x890
[47056.279287] schedule+0x28/0x80
[47056.279288] uvm_spin_loop+0x9d/0xd0 [nvidia_uvm]
[47056.279288] uvm_tracker_wait+0x86/0x1d0 [nvidia_uvm]
[47056.279288] uvm_page_table_range_vec_clear_ptes+0x204/0x2f0 [nvidia_uvm]
[47056.279289] ? uvm_ext_gpu_map_destroy+0xf1/0x1e0 [nvidia_uvm]
[47056.279289] uvm_ext_gpu_map_destroy+0xf1/0x1e0 [nvidia_uvm]
[47056.279289] uvm_va_range_destroy+0xb7/0x450 [nvidia_uvm]
[47056.279290] ? os_acquire_spinlock+0xe/0x20 [nvidia]
[47056.279290] ? _nv013138rm+0xbe/0x100 [nvidia]
[47056.279290] uvm_api_free+0x16d/0x270 [nvidia_uvm]
[47056.279291] uvm_ioctl+0x65e/0x14b0 [nvidia_uvm]
[47056.279291] ? _nv039439rm+0x61/0xb0 [nvidia]
[47056.279291] ? _nv011410rm+0x52/0xa0 [nvidia]
[47056.279291] ? os_acquire_spinlock+0xe/0x20 [nvidia]
[47056.279292] ? _nv038527rm+0xc/0x20 [nvidia]
[47056.279292] ? _nv043294rm+0xde/0x1e0 [nvidia]
[47056.279292] ? rm_ioctl+0x63/0xb0 [nvidia]
[47056.279292] ? __switch_to_asm+0x41/0x70
[47056.279293] ? __switch_to_asm+0x35/0x70
[47056.279293] ? __switch_to_asm+0x41/0x70
[47056.279293] ? __switch_to_asm+0x35/0x70
[47056.279293] ? __switch_to_asm+0x41/0x70
[47056.279293] ? ptep_set_access_flags+0x23/0x30
[47056.279294] ? uvm_unlocked_ioctl+0x2e/0x50 [nvidia_uvm]
[47056.279294] uvm_unlocked_ioctl+0x2e/0x50 [nvidia_uvm]
[47056.279294] uvm_unlocked_ioctl_entry+0x80/0xb0 [nvidia_uvm]
[47056.279294] do_vfs_ioctl+0xa4/0x630
[47056.279295] ? __audit_syscall_entry+0xbc/0x110
[47056.279295] ? syscall_trace_enter+0x1df/0x2e0
[47056.279295] SyS_ioctl+0x74/0x80
[47056.279295] do_syscall_64+0x67/0x110
[47056.279296] entry_SYSCALL_64_after_hwframe+0x41/0xa6
[47056.279296] RIP: 0033:0x7f4022dc4217
[47056.279296] RSP: 002b:00007ffcfda5e0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[47056.279297] RAX: ffffffffffffffda RBX: 000000003f58a4f0 RCX: 00007f4022dc4217
[47056.279297] RDX: 00007ffcfda5e0d0 RSI: 0000000000000022 RDI: 0000000000000004
[47056.279298] RBP: 000000003f58a4f0 R08: 0000000000000000 R09: 0000000000000000
[47056.279298] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[47056.279298] R13: 00007ffcfda5e0d0 R14: 0000000000000004 R15: 00000000aba63300
[47056.279298] Code: 97 71 f3 7e 48 c1 e1 03 48 01 34 08 48 8b 92 b0 00 00 00 48 85 d2 74 1f 48 8b 82 b8 00 00 00 65 48 03 05 74 71 f3 7e 48 01 34 08 <48> 8b 92 b0 00 00 00 48 85 d2 75 e1 c3 90 66 2e 0f 1f 84 00 00
[47074.375707] sysrq: Show backtrace of all active CPUs
similar situation in NVIDIA Driver 525.78.01, debian11 linux 5.10.0-20-amd64
Thank you for the bug report. I have filed NVIDIA internal bug 4121956 to investigate this.
hi,i meet the same problem,the stack:
the version is:470.130,kernel5.10
Thank you for the bug report. I have filed NVIDIA internal bug 4121956 to investigate this.
@johnhubbard very nice. thanks for your support and look forward to your reply
i have the same problem, looking for fixing it as soon as possible
Thanks for following up on this issue internally @johnhubbard, looking forward to a fix.
This issue regularly breaks our CI workflow testing on Nvidia, because docker fails to stop and kill the running container. What we suspect now is that the docker container can get stuck in this state if the running process/job is cancelled while it is in progress (we are using Github Actions), ending up with a defunct process with a stack trace like https://gist.github.com/apartridge/c514d612b276fde6fdd6de047b94e90f. Let us know if we can help with any debugging to understand more.
Let us also know if you become are aware of any workarounds, as currently our only workaround is to reboot the machine.
This is not specific to the open kernel modules. The bug is being investigated and tracked internally, but we'll probably stop reporting on it here, because this site is for issues that are unique to the open kernel modules.
I will note that this type of signature is well correlated with a GPU that has stopped responding, though.
Thank you for the bug report. I have filed NVIDIA internal bug 4121956 to investigate this.
hi, @johnhubbard could you please have any progress?
NVIDIA Open GPU Kernel Modules Version
515.48.07
Does this happen with the proprietary driver (of the same version) as well?
Yes
Operating System and Version
CentOS Linux release 7.9.2009 (Core)
Kernel Release
3.10.0-1160.el7.x86_64
Hardware: GPU
Tesla P100-PCIE-16GB
Describe the bug
Execute pytorch program hangs on uvm_spin_loop() and can't kill it (even kill -9)
stack
Follow the code I saw a TODO in uvm_spin_loop()
// TODO: Bug 1710855: Also check fatal_signal_pending() here if the caller can handle it.
This is a known bug and maybe fix it in new version?
To Reproduce
Bug Incidence
Always
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz
More Info
No response