Please to fix the known bug in uvm_spin_loop()

xuchenCN commented 1 year ago

NVIDIA Open GPU Kernel Modules Version

515.48.07

Does this happen with the proprietary driver (of the same version) as well?

Yes

Operating System and Version

CentOS Linux release 7.9.2009 (Core)

Kernel Release

3.10.0-1160.el7.x86_64

Hardware: GPU

Tesla P100-PCIE-16GB

Describe the bug

Execute pytorch program hangs on uvm_spin_loop() and can't kill it (even kill -9)

stack

[<ffffffffc33d41c2>] uvm_spin_loop+0xb2/0xe0 [nvidia_uvm]
[<ffffffffc3418ab3>] wait_for_entry_with_spin+0x63/0x170 [nvidia_uvm]
[<ffffffffc341912c>] uvm_tracker_wait_for_entry+0x4c/0x70 [nvidia_uvm]
[<ffffffffc3416f2c>] uvm_push_end_and_wait+0x4c/0x70 [nvidia_uvm]
[<ffffffffc33efa42>] channel_pool_add+0x512/0x8e0 [nvidia_uvm]
[<ffffffffc33eff14>] channel_manager_create_pools+0x104/0x1a0 [nvidia_uvm]
[<ffffffffc33f124c>] uvm_channel_manager_create+0xcc/0x360 [nvidia_uvm]
[<ffffffffc33e340b>] init_gpu+0x6cb/0xc40 [nvidia_uvm]
[<ffffffffc33e4dec>] add_gpu+0x7bc/0xdb0 [nvidia_uvm]
[<ffffffffc33e559a>] uvm_gpu_retain_by_uuid+0x1ba/0x230 [nvidia_uvm]
[<ffffffffc33e91ed>] uvm_va_space_register_gpu+0x3d/0x500 [nvidia_uvm]
[<ffffffffc33e68cc>] uvm_api_register_gpu+0x4c/0x70 [nvidia_uvm]
[<ffffffffc33d7da7>] uvm_ioctl+0xed7/0x1790 [nvidia_uvm]
[<ffffffffc33d869c>] uvm_unlocked_ioctl+0x3c/0x60 [nvidia_uvm]
[<ffffffffc33d87a4>] uvm_unlocked_ioctl_entry+0x64/0xd0 [nvidia_uvm]
[<ffffffffa38632e0>] do_vfs_ioctl+0x3a0/0x5b0
[<ffffffffa3863591>] SyS_ioctl+0xa1/0xc0
[<ffffffffa3d93f92>] system_call_fastpath+0x25/0x2a

Follow the code I saw a TODO in uvm_spin_loop()

// TODO: Bug 1710855: Also check fatal_signal_pending() here if the caller can handle it.

This is a known bug and maybe fix it in new version?

To Reproduce

        self.model.train()
        conf.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = nn.DataParallel(self.model,device_ids=[0,1])
        self.model.to(conf.device)

         #############produce bug########################################
        self.head = nn.DataParallel(self.head,device_ids=[0,1])
        self.head.to(conf.device)
         #####################################################

Bug Incidence

Always

nvidia-bug-report.log.gz

More Info

No response

knzivid commented 1 year ago

I am hitting the same thing with ubuntu 22.04, kernel 5.15.0-48-generic, nvidia 510.85.02 on a NVIDIA Corporation GP108 [GeForce GT 1030]. The process does not respond to any signals.

[<0>] uvm_spin_loop+0xd3/0x150 [nvidia_uvm]
[<0>] uvm_tracker_wait+0xce/0x190 [nvidia_uvm]
[<0>] uvm_page_table_range_vec_clear_ptes+0x230/0x350 [nvidia_uvm]
[<0>] uvm_va_range_destroy+0x281/0x490 [nvidia_uvm]
[<0>] destroy_va_ranges.part.0+0x63/0x80 [nvidia_uvm]
[<0>] uvm_user_channel_detach+0x9a/0xd0 [nvidia_uvm]
[<0>] uvm_va_space_detach_all_user_channels+0xa6/0x120 [nvidia_uvm]
[<0>] uvm_va_space_destroy+0x1ed/0x690 [nvidia_uvm]
[<0>] uvm_release.constprop.0+0x42/0xa0 [nvidia_uvm]
[<0>] uvm_release_entry.part.0.isra.0+0x7a/0xa0 [nvidia_uvm]
[<0>] uvm_release_entry+0x2a/0x30 [nvidia_uvm]
[<0>] __fput+0x9f/0x260
[<0>] ____fput+0xe/0x20
[<0>] task_work_run+0x6d/0xb0
[<0>] do_exit+0x21b/0x3c0
[<0>] do_group_exit+0x3b/0xb0
[<0>] get_signal+0x150/0x900
[<0>] arch_do_signal_or_restart+0xde/0x100
[<0>] exit_to_user_mode_loop+0xc4/0x160
[<0>] exit_to_user_mode_prepare+0xa0/0xb0
[<0>] irqentry_exit_to_user_mode+0x9/0x20
[<0>] irqentry_exit+0x1d/0x30
[<0>] sysvec_reschedule_ipi+0x78/0xe0
[<0>] asm_sysvec_reschedule_ipi+0x1a/0x20

huydhn commented 1 year ago

It looks like the same issue that I'm currently seeing on PyTorch CI. We are currently using 525.85.05. The curious thing is that this only happens on NVIDIA A10G on https://aws.amazon.com/ec2/instance-types/g5/ used by our CI.

[47056.277731] NMI backtrace for cpu 12
[47056.277731] CPU: 12 PID: 30404 Comm: python Tainted: P           OE   4.14.252-195.483.amzn2.x86_64 #1
[47056.277732] Hardware name: Amazon EC2 g5.4xlarge/, BIOS 1.0 10/16/2017
[47056.277732] task: ffff888594df2600 task.stack: ffffc9000e484000
[47056.277733] RIP: 0010:pvclock_clocksource_read+0x29/0xb0
[47056.277733] RSP: 0018:ffffc9000e4878c8 EFLAGS: 00000216
[47056.277734] RAX: 000077d74027b748 RBX: 00000000866c27da RCX: 0000000000000000
[47056.277734] RDX: 00000000ffffffff RSI: 0000000000000004 RDI: ffff88901435b300
[47056.277734] RBP: 0000000000000001 R08: 00003beba013db26 R09: 0000000000000200
[47056.277735] R10: ffffc9000e487800 R11: ffff888fd217f7c8 R12: 000000000162dd2c
[47056.277735] R13: ffffffff829b3580 R14: ffff8881903be420 R15: 0000000000000018
[47056.277735] FS:  00007fc8d76a1700(0000) GS:ffff888fd4300000(0000) knlGS:0000000000000000
[47056.277736] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[47056.277736] CR2: 000000c0017fd000 CR3: 000000000200a000 CR4: 00000000003406e0
[47056.277736] Call Trace:
[47056.277736]  kvm_clock_get_cycles+0x1a/0x20
[47056.277737]  getrawmonotonic64+0x3e/0xd0
[47056.277737]  uvm_spin_loop+0x25/0xd0 [nvidia_uvm]
[47056.277737]  uvm_tracker_wait+0x86/0x1d0 [nvidia_uvm]
[47056.277738]  uvm_page_table_range_vec_clear_ptes+0x204/0x2f0 [nvidia_uvm]
[47056.277738]  ? uvm_va_range_destroy+0x2c6/0x450 [nvidia_uvm]
[47056.277738]  uvm_va_range_destroy+0x2c6/0x450 [nvidia_uvm]
[47056.277738]  ? os_acquire_spinlock+0xe/0x20 [nvidia]
[47056.277739]  ? _nv038527rm+0xc/0x20 [nvidia]
[47056.277739]  destroy_va_ranges.part.4+0x4a/0x60 [nvidia_uvm]
[47056.277739]  uvm_user_channel_detach+0x99/0x110 [nvidia_uvm]
[47056.277740]  uvm_gpu_va_space_detach_all_user_channels+0x3b/0x60 [nvidia_uvm]
[47056.277740]  uvm_va_space_detach_all_user_channels+0x3e/0x80 [nvidia_uvm]
[47056.277740]  uvm_va_space_destroy+0x17e/0x420 [nvidia_uvm]
[47056.277741]  uvm_release.isra.4+0x75/0x90 [nvidia_uvm]
[47056.277741]  uvm_release_entry+0x68/0x90 [nvidia_uvm]
[47056.277741]  __fput+0xd2/0x210
[47056.277741]  task_work_run+0x8a/0xb0
[47056.277742]  do_exit+0x390/0xb90
[47056.277742]  ? hrtimer_cancel+0x15/0x20
[47056.277742]  ? futex_wait+0x1d7/0x260
[47056.277742]  do_group_exit+0x3a/0xa0
[47056.277742]  get_signal+0x13f/0x790
[47056.277743]  do_signal+0x36/0x610
[47056.277743]  ? do_futex+0x378/0x4f0
[47056.277743]  ? __check_object_size+0xb4/0x190
[47056.277743]  ? __audit_syscall_exit+0x231/0x2b0
[47056.277743]  exit_to_usermode_loop+0x85/0xc0
[47056.277744]  do_syscall_64+0x101/0x110
[47056.277744]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
[47056.277744] RIP: 0033:0x7fca19eda065
[47056.277745] RSP: 002b:00007fc8d76a0ad0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[47056.277746] RAX: fffffffffffffdfc RBX: 00007fc9e0844aa8 RCX: 00007fca19eda065
[47056.277747] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007fc9e0844ad0
[47056.277747] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff
[47056.277747] R10: 00007fc8d76a0cf0 R11: 0000000000000246 R12: 00007fc9e0844ad8
[47056.277748] R13: 00007fc9e0844ad0 R14: 00007fc8d76a0cf0 R15: 00007fc9e0844acc
[47056.277748] Code: 00 00 55 53 48 83 ec 08 8b 17 89 d6 83 e6 fe 0f ae e8 0f 31 48 c1 e2 20 48 8b 5f 10 0f b6 6f 1d 48 09 d0 0f be 57 1c 48 2b 47 08 <89> d1 49 89 c0 f7 d9 49 d3 e8 89 d1 48 d3 e0 85 d2 8b 57 18 49
[47056.278326] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.017 msecs
[47056.279280] NMI backtrace for cpu 15
[47056.279281] CPU: 15 PID: 25819 Comm: python Tainted: P           OE   4.14.252-195.483.amzn2.x86_64 #1
[47056.279282] Hardware name: Amazon EC2 g5.4xlarge/, BIOS 1.0 10/16/2017
[47056.279282] task: ffff888204e74c00 task.stack: ffffc9000cfe4000
[47056.279282] RIP: 0010:cpuacct_charge+0x68/0x80
[47056.279283] RSP: 0018:ffffc9000cfe7878 EFLAGS: 00000016
[47056.279283] RAX: ffff888fd43d64f0 RBX: ffff888204e74c80 RCX: 0000000000000000
[47056.279284] RDX: ffffffff82065520 RSI: 00000000000000be RDI: ffff888204e74c00
[47056.279284] RBP: 00000000000000be R08: 0000000000000000 R09: 0000000000000200
[47056.279284] R10: ffffc9000cfe7880 R11: ffff888fd217fda8 R12: ffff888ee4a02200
[47056.279285] R13: ffff888ee4a02200 R14: ffff888204e755d0 R15: ffff888204e74c00
[47056.279285] FS:  00007f4023c6b080(0000) GS:ffff888fd43c0000(0000) knlGS:0000000000000000
[47056.279286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[47056.279286] CR2: 000000c001707000 CR3: 00000001bea28000 CR4: 00000000003406e0
[47056.279286] Call Trace:
[47056.279286]  update_curr+0xe1/0x1a0
[47056.279287]  pick_next_task_fair+0x98/0x540
[47056.279287]  ? kvm_sched_clock_read+0x1a/0x30
[47056.279287]  __schedule+0x15e/0x890
[47056.279287]  schedule+0x28/0x80
[47056.279288]  uvm_spin_loop+0x9d/0xd0 [nvidia_uvm]
[47056.279288]  uvm_tracker_wait+0x86/0x1d0 [nvidia_uvm]
[47056.279288]  uvm_page_table_range_vec_clear_ptes+0x204/0x2f0 [nvidia_uvm]
[47056.279289]  ? uvm_ext_gpu_map_destroy+0xf1/0x1e0 [nvidia_uvm]
[47056.279289]  uvm_ext_gpu_map_destroy+0xf1/0x1e0 [nvidia_uvm]
[47056.279289]  uvm_va_range_destroy+0xb7/0x450 [nvidia_uvm]
[47056.279290]  ? os_acquire_spinlock+0xe/0x20 [nvidia]
[47056.279290]  ? _nv013138rm+0xbe/0x100 [nvidia]
[47056.279290]  uvm_api_free+0x16d/0x270 [nvidia_uvm]
[47056.279291]  uvm_ioctl+0x65e/0x14b0 [nvidia_uvm]
[47056.279291]  ? _nv039439rm+0x61/0xb0 [nvidia]
[47056.279291]  ? _nv011410rm+0x52/0xa0 [nvidia]
[47056.279291]  ? os_acquire_spinlock+0xe/0x20 [nvidia]
[47056.279292]  ? _nv038527rm+0xc/0x20 [nvidia]
[47056.279292]  ? _nv043294rm+0xde/0x1e0 [nvidia]
[47056.279292]  ? rm_ioctl+0x63/0xb0 [nvidia]
[47056.279292]  ? __switch_to_asm+0x41/0x70
[47056.279293]  ? __switch_to_asm+0x35/0x70
[47056.279293]  ? __switch_to_asm+0x41/0x70
[47056.279293]  ? __switch_to_asm+0x35/0x70
[47056.279293]  ? __switch_to_asm+0x41/0x70
[47056.279293]  ? ptep_set_access_flags+0x23/0x30
[47056.279294]  ? uvm_unlocked_ioctl+0x2e/0x50 [nvidia_uvm]
[47056.279294]  uvm_unlocked_ioctl+0x2e/0x50 [nvidia_uvm]
[47056.279294]  uvm_unlocked_ioctl_entry+0x80/0xb0 [nvidia_uvm]
[47056.279294]  do_vfs_ioctl+0xa4/0x630
[47056.279295]  ? __audit_syscall_entry+0xbc/0x110
[47056.279295]  ? syscall_trace_enter+0x1df/0x2e0
[47056.279295]  SyS_ioctl+0x74/0x80
[47056.279295]  do_syscall_64+0x67/0x110
[47056.279296]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
[47056.279296] RIP: 0033:0x7f4022dc4217
[47056.279296] RSP: 002b:00007ffcfda5e0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[47056.279297] RAX: ffffffffffffffda RBX: 000000003f58a4f0 RCX: 00007f4022dc4217
[47056.279297] RDX: 00007ffcfda5e0d0 RSI: 0000000000000022 RDI: 0000000000000004
[47056.279298] RBP: 000000003f58a4f0 R08: 0000000000000000 R09: 0000000000000000
[47056.279298] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[47056.279298] R13: 00007ffcfda5e0d0 R14: 0000000000000004 R15: 00000000aba63300
[47056.279298] Code: 97 71 f3 7e 48 c1 e1 03 48 01 34 08 48 8b 92 b0 00 00 00 48 85 d2 74 1f 48 8b 82 b8 00 00 00 65 48 03 05 74 71 f3 7e 48 01 34 08 <48> 8b 92 b0 00 00 00 48 85 d2 75 e1 c3 90 66 2e 0f 1f 84 00 00
[47074.375707] sysrq: Show backtrace of all active CPUs

nvidia-bug-report.log.gz

Davidrjx commented 1 year ago

similar situation in NVIDIA Driver 525.78.01, debian11 linux 5.10.0-20-amd64

johnhubbard commented 1 year ago

Thank you for the bug report. I have filed NVIDIA internal bug 4121956 to investigate this.

chentao-kernel commented 1 year ago

hi，i meet the same problem，the stack：

chentao-kernel commented 1 year ago

the version is：470.130，kernel5.10

Davidrjx commented 1 year ago

Thank you for the bug report. I have filed NVIDIA internal bug 4121956 to investigate this.

@johnhubbard very nice. thanks for your support and look forward to your reply

Iamleos commented 1 year ago

i have the same problem, looking for fixing it as soon as possible

apartridge commented 1 year ago

Thanks for following up on this issue internally @johnhubbard, looking forward to a fix.

This issue regularly breaks our CI workflow testing on Nvidia, because docker fails to stop and kill the running container. What we suspect now is that the docker container can get stuck in this state if the running process/job is cancelled while it is in progress (we are using Github Actions), ending up with a defunct process with a stack trace like https://gist.github.com/apartridge/c514d612b276fde6fdd6de047b94e90f. Let us know if we can help with any debugging to understand more.

Let us also know if you become are aware of any workarounds, as currently our only workaround is to reboot the machine.

johnhubbard commented 12 months ago

This is not specific to the open kernel modules. The bug is being investigated and tracked internally, but we'll probably stop reporting on it here, because this site is for issues that are unique to the open kernel modules.

I will note that this type of signature is well correlated with a GPU that has stopped responding, though.

NetEase-FuXi commented 12 months ago

Thank you for the bug report. I have filed NVIDIA internal bug 4121956 to investigate this.

hi, @johnhubbard could you please have any progress?

NVIDIA / open-gpu-kernel-modules