NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15k stars 1.24k forks source link

Timeout waiting for RPC from GSP! #446

Open ghost opened 1 year ago

ghost commented 1 year ago

NVIDIA Open GPU Kernel Modules Version

525.85.05

Does this happen with the proprietary driver (of the same version) as well?

I cannot test this

Operating System and Version

Arch Linux

Kernel Release

Linux [HOSTNAME] 6.1.8-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 24 Jan 2023 21:07:04 +0000 x86_64 GNU/Linux

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 3050 Laptop GPU (UUID: GPU-071149ae-386e-0017-3b5b-7ea80801f725)

Describe the bug

When I open a OpenGL application, like Yamagi Quake II, at a certain point the whole system freezes, and run in like 1 FPS per second. I generally have to REISUB when this happens.

To Reproduce

  1. Open Yamagi Quake II
  2. Change workspace, open pavucontrol to select a new audio sink for the game, switch back

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

Related: #272

mdrasheek commented 1 year ago

Hi everyone, We have identified the issue related to our case with the help of Nvidia enterprise support team. In our machines, all the GPU cards are connected via NVLinks and as per the Nvidia doc here

All GPUs directly connected to each other through NVLink must be assigned to the same VM

So, after removing NVlinks, its working as expected. Hope this helps some one.

paulraines68 commented 1 year ago

@mdrasheek We were running on a box with 8 GPUS and NVLinks. We are not doing VMs but running SLURM jobs that allocate cores/memory by cgroups and somehow isolate GPUs assigned too. My guess is that is similar to what the VMs do.

Still we don't want to remove the NVLinks as jobs that allocate mulitple GPUs should still be able to take advantage of them.

mdrasheek commented 1 year ago

@paulraines68 As you have mentioned you are using the driver version 530.x, which is not a datacenter driver. Check the supported products of driver 530.x in this page you will not find A100. You should use data center drivers on which they say the Xid119 errors are fixed. Refer here to know more about datacenter drivers.

If you want to install using yum, you can refer to official installation instructions here.

alpiquero commented 12 months ago

I can confirm the issue on our machines running the 535 driver version. Disabling GSP workarounds the problem.

mdrasheek commented 11 months ago

I can confirm the issue on our machines running the 535 driver version. Disabling GSP workarounds the problem.

You can check which cards are interconnected with NVlinks using the command nvidia-smi topo -m. For example, GPU0 and GPU1 is connected with NV8, GPU2 and GPU3 is connected with NV12, if you allocate a job to GPU0&1, and another job to GPU2&3 it shouldn't have the issue. But when you mix up (like allocating jobs to GPU1 and GPU2) you might have an issue.

Hope this helps!

Qubitium commented 11 months ago

Just encountered this bug on A100 PCIE 80GB on all latest 525/530/535 official Linux drivers. Nvidia, this is a disgrace of a support for a 10K, or 13K (due to price gouging) gpu. Please fix this.

mckbrchill commented 11 months ago

Hi everyone, We have identified the issue related to our case with the help of Nvidia enterprise support team. In our machines, all the GPU cards are connected via NVLinks and as per the Nvidia doc here

All GPUs directly connected to each other through NVLink must be assigned to the same VM

So, after removing NVlinks, its working as expected. Hope this helps some one.

Hi @mdrasheek ! I'm facing the same issue on A100: timeout from RPC and when I disable it, torch.cuda.is_available() takes 3 minutes to execute and other torch code is very slow on first initialization. Does removing NVlinks help to fix pytoch speed? And may be you've found some other workarounds? Thanks!

mdrasheek commented 10 months ago

Hi everyone, We have identified the issue related to our case with the help of Nvidia enterprise support team. In our machines, all the GPU cards are connected via NVLinks and as per the Nvidia doc here All GPUs directly connected to each other through NVLink must be assigned to the same VM So, after removing NVlinks, its working as expected. Hope this helps some one.

Hi @mdrasheek ! I'm facing the same issue on A100: timeout from RPC and when I disable it, torch.cuda.is_available() takes 3 minutes to execute and other torch code is very slow on first initialization. Does removing NVlinks help to fix pytoch speed? And may be you've found some other workarounds? Thanks!

Yes, when we remove NVLinks we are able to use the cards in separate VMs.

Or check the NVLinks connectivity using nvidia-smi topo -m. If 2 cards interconnected using the same NVLinks, you can use those 2 cards in a single VM.

A sample screenshot is given below from 4x VM:

image

You can see the GPU 0 and 1 is connected to an NVlink and GPU 2 and 3 are connected with another NVlink. So, the following deployment options will work

  1. VM launched with GPU 0 and 1
  2. VM launched with GPU 2 and 3
  3. VM with all the 4 cards.

    Or as I mentioned earlier, if you are using single GPU for single process you can enable MIG, create MIG slices and use it acccordingly.

mckbrchill commented 10 months ago

Or check the NVLinks connectivity using nvidia-smi topo -m. If 2 cards interconnected using the same NVLinks, you can use those 2 cards in a single VM.

I rent VM with a single GPU from a cloud provider and nvidia-smi topo -m check shows that the connectivity for my GPU is 'Self' (X). Unfortunately, I'm still experiencing 3+ minutes delay on first torch.cuda.is_available() in a process. Although I can see in nvidia-smi -i 0 -q that GPU Virtualization Mode is Pass-Through. And the cloud works so, that my VM is not always assigned to some specific GPU, but it can vary on which one GPU from the A100s pool is free now. Not sure, but may be there is something behind the proccess of how it's organized on the cloud provider side, and I just can't directly see if there is NVLink. I also tested at other cloud provider VM and there were no problems with GSP timeout errors. The difference in terms of GPU on these VMs was: on a problematic VM there is A100-SXM4-80GB and GPU Virtualization Mode is Pass-Through, on the second one I had A100-PCIE-40GB and GPU Virtualization Mode - None

Haven't tried MIG slices yet, may be it will help. Thanks for your answer!

mdrasheek commented 10 months ago

Or check the NVLinks connectivity using nvidia-smi topo -m. If 2 cards interconnected using the same NVLinks, you can use those 2 cards in a single VM.

I rent VM with a single GPU from a cloud provider and nvidia-smi topo -m check shows that the connectivity for my GPU is 'Self' (X). Unfortunately, I'm still experiencing 3+ minutes delay on first torch.cuda.is_available() in a process. Although I can see in nvidia-smi -i 0 -q that GPU Virtualization Mode is Pass-Through. And the cloud works so, that my VM is not always assigned to some specific GPU, but it can vary on which one GPU from the A100s pool is free now. Not sure, but may be there is something behind the proccess of how it's organized on the cloud provider side, and I just can't directly see if there is NVLink. I also tested at other cloud provider VM and there were no problems with GSP timeout errors. The difference in terms of GPU on these VMs was: on a problematic VM there is A100-SXM4-80GB and GPU Virtualization Mode is Pass-Through, on the second one I had A100-PCIE-40GB and GPU Virtualization Mode - None

Haven't tried MIG slices yet, may be it will help. Thanks for your answer!

SXM4 mostly comes in a bundle of 4 or 8 cards connected with NVlinks and it is possible that your cloud provider gives 1 card from it. You have to go with MIG and it should help you to overcome the lag issue.

oavner commented 10 months ago

for all the k8s users out there looking for gitOps way to fix this, were using the gpu-operator as a chart dependency in our own repo and added the following configMap to the templates deployed by our chart:

apiVersion: v1
kind: ConfigMap
data:
  nvidia.conf: |
    NVreg_EnableGpuFirmware=0
metadata:
  name: kernel-module-params

we also set the chart value driver.kernelModuleConfig.name to kernel-module-params in order to update the cluster-policy with the new configMap name.

it might require node reboot but it works perfectly.

neon-ninja commented 9 months ago

I want to highlight a point here, post enabling MIG and creating a full MIG Slice the GPU works without lag.

nvidia-smi -mig 1 nvidia-smi mig -cgi 0 -C

The above commands will create slice 0 which is the entire 80GB MIG slice. But with MIG mode disabled the lag comes again. I hope this may lead a way for the resolution for some one.

Thanks, this worked for me

robkooper commented 5 months ago

I noticed the following, running in openstack with 4 A100 GPUs. I created 2 VMs, each with a GPU in there. I installed conda and pytorch for testing. When testing I open a python terminal and run the following code, do not close python leave it open.

import torch
torch.cuda.is_available()

Both driver 470, everything works, does not matter what order you do this.

Both driver 535, the first machine to run the python code has access to the GPU, the second machine fails.

Machine 1 has driver 470, Machine 2 has driver 535, if you start on machine 1 (470) and then machine 2 (535) the second machine can not get access to the GPU. If you first do machine 2 (535) and then machine 1 (470) both machines have access to the GPU.

Took a while to find the exact sequence, but hopefully this might help somebody to find out what the issue is. But it feels like driver 535 is locking all the GPUs, and driver 470 does not look at the lock, or just ignores it.

hansbogert commented 3 months ago

The RPC timout is just a red-herring in my opinion. Something is just wrong in our setup, removing the timeout just makes the environment wait longer and giving it a success in the end; However, every invocation of a cuda kernel takes 5 minutes in my case in a kubernetes cluster.

During those 5 minutes, the host application is at 100% CPU, doing ... well, I don't know, strace didn't give anything meaningful.

jelmd commented 3 months ago

... However, every invocation of a cuda kernel takes 5 minutes in my case in a kubernetes cluster.

During those 5 minutes, the host application is at 100% CPU, doing ... well, I don't know, strace did give anything meaningful.

FWIW: We do not use kubernetes but lxc, and observed similar strange behavior with H100 pcie based boxes, sometimes 1 or 2 of 4 GPUs were not even usable (launch took forever). Often a bare metal box reboot got it back to normal behavior, but not always. A distro upgrade from Ubuntu 20.04.06 to 22.04.4 fixed this problem (upgraded the nv driver from 535.86.10 to 535.161.08 as well). No problems anymore (for ~ 8 weeks).

jelmd commented 3 months ago

PS: In case you cannot upgrade (or if it does not help), booting with kernel params amd_iommu=pt pci=realloc=off might help.

mtijanic commented 3 months ago

As mentioned in https://github.com/NVIDIA/open-gpu-kernel-modules/issues/446#issuecomment-1445748798 "Timeout waiting for RPC from GSP" is about as generic an error as you can get, and any internal GSP issue can lead to the same message being printed. nvidia-bug-report.log will contain more info and we can use that to see if the two such errors are related or not, and maybe pinpoint the root cause. So, please attach these next time you run into said issue.

hansbogert commented 3 months ago

This definitely is Nvlink related, that also makes sense that enabling MIG 'solves' it, because MIG does not use NVlink.

Disabling NVLink as kernel module parameter 'solved' it for my single VM / single GPU workloads, i.e., use

│     NVreg_NvLinkDisable=1                                                                                         │

However, now I still have the issue where I have 1 VM / 2 GPUs in which I'd rather not remove the NVLink capabilities as I assume the performance will be impacted.

So my questions becomes, is it really a hard requirement to have

All GPUs directly connected to each other through NVLink must be assigned to the same VM

That would be cumbersome because having 1 big VM makes maintenance a whole lot more difficult for use mere mortals who only have 1 GPU node

save-buffer commented 3 months ago

Hi, I seem to be running into this issue on my desktop with an NVIDIA 3060 (Driver 555.42.02). I have a backtrace from dmesg here if it's helpful. The problem is reproduced consistently. The following appears in dmesg every time I try to run nvidia-smi.

[   22.275339] NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************
[   22.275344] NVRM: _kgspLogXid119: Note: Please also check logs above.
[   22.275346] NVRM: nvAssertFailedNoLog: Assertion failed: expectedFunc == pHistoryEntry->function @ kernel_gsp.c:1826
[   22.275355] NVRM: GPU at PCI:0000:83:00: GPU-5d732ef6-0814-bfa9-536d-461dab42db2c
[   22.275357] NVRM: Xid (PCI:0000:83:00): 119, pid=849, name=nv_open_q, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 4097 (GSP_INIT_DONE) (0x0 0x0).
[   22.275360] NVRM: GPU0 GSP RPC buffer contains function 4108 (UCODE_LIBOS_PRINT) and data 0x0000000000000000 0x0000000000000000.
[   22.275362] NVRM: GPU0 RPC history (CPU -> GSP):
[   22.275363] NVRM:     entry function                   data0              data1              ts_start           ts_end             duration actively_polling
[   22.275364] NVRM:      0    73   SET_REGISTRY          0x0000000000000000 0x0000000000000000 0x000619a5fc23b6e6 0x0000000000000000          y
[   22.275366] NVRM:     -1    72   GSP_SET_SYSTEM_INFO   0x0000000000000000 0x0000000000000000 0x000619a5fc23b6e0 0x0000000000000000           
[   22.275368] NVRM: GPU0 RPC event history (CPU <- GSP):
[   22.275369] NVRM:     entry function                   data0              data1              ts_start           ts_end             duration during_incomplete_rpc
[   22.275370] NVRM:      0    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x000619a5fc31df0f 0x000619a5fc31df10      1us y
[   22.275372] NVRM:     -1    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x000619a5fc31dd90 0x000619a5fc31dd90          y
[   22.275374] NVRM:     -2    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000027 0x000619a5fc31aecc 0x000619a5fc31aecd      1us y
[   22.275375] NVRM:     -3    4098 GSP_RUN_CPU_SEQUENCER 0x000000000000060a 0x0000000000003fe2 0x000619a5fc30fa9e 0x000619a5fc310cd3   4661us y
[   22.275377] NVRM:     -4    4128 GSP_POST_NOCAT_RECORD 0x0000000000000005 0x00000000014eb9bc 0x000619a5fc2c87e6 0x000619a5fc2c87e6          y
[   22.275379] NVRM:     -5    4128 GSP_POST_NOCAT_RECORD 0x0000000000000005 0x00000000014eb9bc 0x000619a5fc2c8476 0x000619a5fc2c847a      4us y
[   22.275381] CPU: 14 PID: 849 Comm: nv_open_q Tainted: G           OE      6.8.11_1 #1
[   22.275384] Hardware name: To Be Filled By O.E.M. TRX50 WS/TRX50 WS, BIOS 7.09 01/23/2024
[   22.275385] Call Trace:
[   22.275386]  <TASK>
[   22.275388]  dump_stack_lvl+0x64/0x80
[   22.275395]  _kgspRpcRecvPoll+0x42d/0x4c0 [nvidia]
[   22.275503]  kgspWaitForRmInitDone_IMPL+0x31/0x102 [nvidia]
[   22.275573]  kgspBootstrap_TU102+0x1f7/0x330 [nvidia]
[   22.275629]  kgspInitRm_IMPL+0x767/0x1150 [nvidia]
[   22.275683]  ? ia32_sys_call+0x420/0x1d20
[   22.275687]  ? rm_get_uefi_console_status+0x32/0x40 [nvidia]
[   22.275764]  RmInitAdapter+0x114a/0x1c90 [nvidia]
[   22.275822]  ? srso_alias_return_thunk+0x5/0xfbef5
[   22.275825]  ? srso_alias_return_thunk+0x5/0xfbef5
[   22.275827]  ? preempt_count_add+0x6e/0xa0
[   22.275830]  ? srso_alias_return_thunk+0x5/0xfbef5
[   22.275832]  ? _raw_spin_lock_irqsave+0x27/0x60
[   22.275834]  ? srso_alias_return_thunk+0x5/0xfbef5
[   22.275837]  ? srso_alias_return_thunk+0x5/0xfbef5
[   22.275839]  rm_init_adapter+0xad/0xc0 [nvidia]
[   22.275899]  nv_open_device+0x205/0xa00 [nvidia]
[   22.275951]  nvidia_open_deferred+0x38/0xa0 [nvidia]
[   22.276002]  _main_loop+0x90/0x150 [nvidia]
[   22.276058]  ? __pfx__main_loop+0x10/0x10 [nvidia]
[   22.276108]  kthread+0xf4/0x130
[   22.276111]  ? __pfx_kthread+0x10/0x10
[   22.276113]  ret_from_fork+0x31/0x50
[   22.276115]  ? __pfx_kthread+0x10/0x10
[   22.276117]  ret_from_fork_asm+0x1b/0x30
[   22.276121]  </TASK>
[   22.276122] NVRM: _kgspLogXid119: ********************************************************************************
[   22.276125] NVRM: nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from rpcRecvPoll(pGpu, pRpc, NV_VGPU_MSG_EVENT_GSP_INIT_DONE) @ kernel_gsp.c:4106
[   22.276127] NVRM: nvAssertOkFailedNoLog: Assertion failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from kgspWaitForRmInitDone(pGpu, pKernelGsp) @ kernel_gsp_tu102.c:489
[   22.276139] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
[   22.277771] NVRM: GPU 0000:83:00.0: RmInitAdapter failed! (0x62:0x65:1860)
[   22.279008] NVRM: GPU 0000:83:00.0: rm_init_adapter failed, device minor number 0
[   22.439133] nvidia-uvm: Loaded the UVM driver, major device number 506.

My /etc/modprobe.d/nvidia.conf contains

options nvidia NVreg_OpenRmEnableUnsupportedGpus=1
options nvidia NVreg_NvLinkDisable=1
options nvidia NVreg_EnableGpuFirmware=0
hansbogert commented 3 months ago

@save-buffer how is this related to the previous comments? The stacktrace is different and your desktop GPU does not have nvlink

save-buffer commented 3 months ago

The thread was about GSP being frozen, no?

neon-ninja commented 2 months ago

Upgrading to nvidia driver version 555.42.02 seems to fix this problem, without needing MIG