NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.14k stars 1.26k forks source link

Version difference with the proprietary driver causes csgo to crash #272

Closed Sumandora closed 2 years ago

Sumandora commented 2 years ago

NVIDIA Open GPU Kernel Modules Version

515.43.04-7 installed via https://archlinux.org/packages/extra/x86_64/nvidia-open/

Does this happen with the proprietary driver (of the same version) as well?

No

Operating System and Version

Arch Linux / KDE Plasma

Kernel Release

Linux arch 5.17.9-arch1-1 #1 SMP PREEMPT Wed, 18 May 2022 17:30:11 +0000 x86_64 GNU/Linux

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 2060

Describe the bug

After a certain time of csgo running, it crashes. It crashing has the side effect of freezing the entire Xorg server.

To Reproduce

I'm not able to force this crash, but I never got through the freeze time. So testing it is very easy.

Bug Incidence

Always

nvidia-bug-report.log.gz

I'm sadly unable to provide this, due to having switched to the proprietary driver for now. But I'm able to provide the following: The csgo crash:

[  283.236321] csgo_linux64[1476]: segfault at 0 ip 00007faae5c9396b sp 00007faaca50a140 error 6 in client_client.so[7faae4e00000+1dfa000]
[  283.236343] Code: c7 83 c8 1f 01 00 00 00 00 00 44 8b a3 a8 0d 01 00 4c 8d ab c0 0c 01 00 e8 a2 78 ab ff 8b 70 20 48 8b 08 49 63 d4 48 c1 e2 06 <89> 34 11 48 8b 08 48 c7 44 11 08 00 00 00 00 44 89 60 20 48 8d 83

A bunch of errors before csgo finally crashing:

[  274.920300] NVRM: Xid (PCI:0000:29:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[  274.920302] NVRM _issueRpcAndWait: rpcRecvPoll timedout for fn 10!
[  274.920302] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[  274.920304] NVRM rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d0003f; hObject=0xbeee502d; paramsStatus=0x00000000; status=0x00000065
[  274.920305] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:805
[  274.920308] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:155
[  274.920312] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:995
[  274.920319] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 32dccf22f4de01 >= 32dccf22c202bc
[  274.920320] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[  274.920321] NVRM: Xid (PCI:0000:29:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[  274.920323] NVRM _issueRpcAndWait: rpcRecvPoll timedout for fn 10!
[  274.920324] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[  274.920325] NVRM rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d0003f; hObject=0xbeee3901; paramsStatus=0x00000000; status=0x00000065
[  274.920326] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:805
[  274.920384] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:155
[  274.920388] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:995
[  274.920396] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 32dccf22f4de01 >= 32dccf22c202bc
[  274.920397] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[  274.920398] NVRM: Xid (PCI:0000:29:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[  274.920400] NVRM _issueRpcAndWait: rpcRecvPoll timedout for fn 10!
[  274.920401] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[  274.920402] NVRM rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d0003f; hObject=0xbeee0100; paramsStatus=0x00000000; status=0x00000065
[  278.942591] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[  278.942595] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[  278.942597] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[  278.942598] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[  278.942599] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[  278.942601] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[  282.922540] NVRM _kgspRpcDrainOneEvent: Failed to process received event 4100: status=0x1f
[  282.922546] NVRM _issueRpcAndWait: rpcRecvPoll failed with status 0x0000001f for fn 4100!
[  282.922547] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[  282.922565] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:0:0:0x0000001f
[  282.928478] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[  282.959818] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0x4c

More Info

I looked at the disassembly of the code region which crashed the game. There were 3 MOV instructions. The third being the one which crashed the game. It looked like a class or array which is being written to, but the method does not have any strings nearby so I'm unsure about what is actually happening.

The only "fix" I know of is to switch to the proprietary driver.

LethalManBoob commented 2 years ago

I dont know if this warrents its own thread but the open source dkms drivers for nvidia cause opengl games to softlock my whole pc. the whole pc runs at like one frame every 1 minute and i need to hard shut down. normal drivers have no such issue. rtx3070 archlinux ryzen 3800x

ill give more detailed info if needed but im not installing it again and risk issues with my hardware due to the restarts

TheBill2001 commented 2 years ago

cause opengl games to softlock my whole pc

Seem to be the same problem, I also have similar error message as @Sumandora while playing Factorio.

Sumandora commented 2 years ago

I dont know if this warrents its own thread but the open source dkms drivers for nvidia cause opengl games to softlock my whole pc. the whole pc runs at like one frame every 1 minute and i need to hard shut down. normal drivers have no such issue. rtx3070 archlinux ryzen 3800x

ill give more detailed info if needed but im not installing it again and risk issues with my hardware due to the restarts

Restarting is not necessary. Press Ctrl+Alt+Backspace. This hotkey "zaps" the entire xorg server. Practically restarting it. Try to use dmesg after that and check if you got similar errors. If so it is likely that you have the same error

amrit1711 commented 2 years ago

@Sumandora Thanks for reporting issue, I have filed a bug 3664227 internally for tracking purpose. Shall try to reproduce issue locally first and if required any further information, I will get back to you.

rnd-ash commented 2 years ago

I've noticed the same with other source engine games such as Black-Mesa and TF2.

[21155.483662] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 32efcc9f0bedc9 >= 32efcc9ed91284
[21155.483666] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[21155.483672] NVRM: Xid (PCI:0000:01:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[21155.483676] NVRM _issueRpcAndWait: rpcRecvPoll timedout for fn 10!
[21155.483677] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[21155.483681] NVRM rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d000cd; hObject=0xbeee4901; paramsStatus=0x00000000; status=0x00000065
[21155.483684] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:805
[21155.483703] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:155
[21155.483709] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:995
[21155.483722] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 32efcc9f0bedc9 >= 32efcc9ed91284
[21155.483722] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[21155.483723] NVRM: Xid (PCI:0000:01:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[21155.483725] NVRM _issueRpcAndWait: rpcRecvPoll timedout for fn 10!
[21155.483726] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[21155.483728] NVRM rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d000cd; hObject=0xbeeea0b5; paramsStatus=0x00000000; status=0x00000065
[21155.483728] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:805
[21155.483732] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:155
[21155.483735] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:995
[21155.483847] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 32efcc9f0bedc9 >= 32efcc9ed91284
[21155.483849] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[21155.483852] NVRM: Xid (PCI:0000:01:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[21155.483854] NVRM _issueRpcAndWait: rpcRecvPoll timedout for fn 10!
[21155.483855] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[21155.483857] NVRM rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d00008; hObject=0xb00f000e; paramsStatus=0x00000000; status=0x00000065
[21155.483864] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 32efcc9f0bedc9 >= 32efcc9ed91284
[21155.483865] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[21155.483866] NVRM: Xid (PCI:0000:01:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[21155.483868] NVRM _issueRpcAndWait: rpcRecvPoll timedout for fn 10!
[21155.483868] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[21155.483870] NVRM rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d000cd; hObject=0xbeee9097; paramsStatus=0x00000000; status=0x00000065
[21155.483871] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:805
[21155.483881] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:155
[21155.483884] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:995
[21155.483891] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 32efcc9f0bedc9 >= 32efcc9ed91284
[21155.483892] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[21155.483893] NVRM: Xid (PCI:0000:01:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[21155.483894] NVRM _issueRpcAndWait: rpcRecvPoll timedout for fn 10!
[21155.483894] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[21155.483896] NVRM rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d000cd; hObject=0xbeee0404; paramsStatus=0x00000000; status=0x00000065
[21155.483905] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 32efcc9f0bedc9 >= 32efcc9ed91284
[21155.483905] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[21155.483906] NVRM: Xid (PCI:0000:01:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[21155.483907] NVRM _issueRpcAndWait: rpcRecvPoll timedout for fn 10!
[21155.483908] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[21155.483910] NVRM rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d000cd; hObject=0xbeee502d; paramsStatus=0x00000000; status=0x00000065
[21155.483910] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:805
[21155.483914] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:155
[21155.483916] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:995
[21155.483921] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 32efcc9f0bedc9 >= 32efcc9ed91284
[21155.483922] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[21155.483922] NVRM: Xid (PCI:0000:01:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[21155.483923] NVRM _issueRpcAndWait: rpcRecvPoll timedout for fn 10!
[21155.483924] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[21155.483925] NVRM rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d000cd; hObject=0xbeee3901; paramsStatus=0x00000000; status=0x00000065
[21155.483926] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:805
[21155.483984] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:155
[21155.483987] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:995
[21155.483994] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 32efcc9f0bedc9 >= 32efcc9ed91284
[21155.483995] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[21155.483996] NVRM: Xid (PCI:0000:01:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[21155.483997] NVRM _issueRpcAndWait: rpcRecvPoll timedout for fn 10!
[21155.483997] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[21155.483999] NVRM rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d000cd; hObject=0xbeee0100; paramsStatus=0x00000000; status=0x00000065
[21162.086847] GpuWatchdog[42469]: segfault at 0 ip 00007f8cc115431f sp 00007f8cb6ffd6a0 error 6 in libcef.so[7f8cbcf19000+6f5e000]
[21162.086855] Code: 89 de e8 74 30 8e fe 80 7d cf 00 79 09 48 8b 7d b8 e8 e5 d5 d1 02 41 8b 84 24 e0 00 00 00 89 45 b8 48 8d 7d b8 e8 e1 51 dc fb <c7> 04 25 00 00 00 00 37 13 00 00 48 83 c4 38 5b 41 5c 41 5d 41 5e
[21162.086891] audit: type=1701 audit(1654003774.386:330): auid=1000 uid=1000 gid=1000 ses=3 pid=42465 comm="GpuWatchdog" exe="/home/ashcon/.local/share/Steam/ubuntu12_64/steamwebhelper" sig=11 res=1
[21162.091176] audit: type=1334 audit(1654003774.392:331): prog-id=44 op=LOAD
[21162.091226] audit: type=1334 audit(1654003774.392:332): prog-id=45 op=LOAD
[21162.091240] audit: type=1334 audit(1654003774.392:333): prog-id=46 op=LOAD
[21162.124166] audit: type=1130 audit(1654003774.426:334): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@2-42545-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[21162.629143] audit: type=1131 audit(1654003774.929:335): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@2-42545-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[21162.753905] audit: type=1334 audit(1654003775.056:336): prog-id=0 op=UNLOAD
[21162.753910] audit: type=1334 audit(1654003775.056:337): prog-id=0 op=UNLOAD
[21162.753911] audit: type=1334 audit(1654003775.056:338): prog-id=0 op=UNLOAD
[21163.555733] NVRM nvCheckFailedNoLog: Check failed: pKernelChannel != NULL @ kernel_gsp.c:274
[21163.555736] NVRM _kgspRpcDrainOneEvent: Failed to process received event 4100: status=0x21
[21163.555739] NVRM _issueRpcAndWait: rpcRecvPoll failed with status 0x00000021 for fn 4100!
[21163.555740] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[21163.555758] NVRM rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d000d8; hObject=0xbeef0013; paramsStatus=0x00000001; status=0x00000021
[21163.555760] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ mem.c:171
[21163.591402] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[21163.591404] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[21163.592269] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[21163.593361] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[21163.594128] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[21163.594337] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[21163.594413] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[21163.594503] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[21163.594600] NVRM _kgspProcessRpcEvent: Unexpected RPC function 0xa
[21163.611528] NVRM serverFreeResourceTree: hObject 0xbeee0400 not found for client 0xc1d000cd
[21163.611537] NVRM serverFreeResourceTree: hObject 0xbeee0401 not found for client 0xc1d000cd
[21163.611547] NVRM serverFreeResourceTree: hObject 0xbeee0402 not found for client 0xc1d000cd
[21163.611556] NVRM serverFreeResourceTree: hObject 0xbeee0403 not found for client 0xc1d000cd
mtijanic commented 2 years ago

For some reason, the RPC to clean up some resources does not finish in time. This probably mean that GSP got stuck in an infinite loop or otherwise crashed, but it could be that it just needs to do way too much to free it.

Could you maybe reproduce with NVreg_RmMsg=":" to capture more debug logs?

Also, do you perhaps know if in the minute leading up to the crash there was a graphics-heavy process shutdown? Or maybe a particular change int he game, such as a level reload? Something that could trigger a cleanup of a large resource tree. One way we have seen this happen in the past is if a GPU-superheavy process gets a SIGKILL and the driver attempts to reap all the allocations.

LethalManBoob commented 2 years ago

In my case i can say that it was completly random and it was every thing from counter strike source to cruelty squad.

Sumandora commented 2 years ago

For some reason, the RPC to clean up some resources does not finish in time. This probably mean that GSP got stuck in an infinite loop or otherwise crashed, but it could be that it just needs to do way too much to free it.

Could you maybe reproduce with NVreg_RmMsg=":" to capture more debug logs?

Also, do you perhaps know if in the minute leading up to the crash there was a graphics-heavy process shutdown? Or maybe a particular change int he game, such as a level reload? Something that could trigger a cleanup of a large resource tree. One way we have seen this happen in the past is if a GPU-superheavy process gets a SIGKILL and the driver attempts to reap all the allocations.

I can't test until I get home, but from my experience crashes happen as soon as I got transferred from spectator to actual player, also got a crash as soon as getting shot. So that could possibly be it. Over a few rather weird attempts of getting more information about the referenced asm in dmesg. I found out that I apparently crashed because the freeze time effect or something like that was loaded. But I am not sure if I analysed that correctly because my way of analyzing was to take a source leak of csgo and trying to find the bytes but they were not there. Maybe compiler or version (leak is from 2018) differences. But I tried to find a connection between the version regardless. So tldr: It could very well be that.

Posting full crashlog when home

amrit1711 commented 2 years ago

@Sumandora I also tried reproducing issue on below configuration setup but did not observed any crash after playing CSGO for an hour. Arch Linux  +  5.18.1-arch1-1  + NVIDIA GeForce RTX 2080  +  Driver 515.48.07 with --no-kernel-modules  +  DELL U2412M

Can you please confirm repro frequency and share nvidia bug report. Also let me know if you have any specific game or display settings.

Sumandora commented 2 years ago

For some reason, the RPC to clean up some resources does not finish in time. This probably mean that GSP got stuck in an infinite loop or otherwise crashed, but it could be that it just needs to do way too much to free it. Could you maybe reproduce with NVreg_RmMsg=":" to capture more debug logs? Also, do you perhaps know if in the minute leading up to the crash there was a graphics-heavy process shutdown? Or maybe a particular change int he game, such as a level reload? Something that could trigger a cleanup of a large resource tree. One way we have seen this happen in the past is if a GPU-superheavy process gets a SIGKILL and the driver attempts to reap all the allocations.

I can't test until I get home, but from my experience crashes happen as soon as I got transferred from spectator to actual player, also got a crash as soon as getting shot. So that could possibly be it. Over a few rather weird attempts of getting more information about the referenced asm in dmesg. I found out that I apparently crashed because the freeze time effect or something like that was loaded. But I am not sure if I analysed that correctly because my way of analyzing was to take a source leak of csgo and trying to find the bytes but they were not there. Maybe compiler or version (leak is from 2018) differences. But I tried to find a connection between the version regardless. So tldr: It could very well be that.

Posting full crashlog when home

I used the logging argument. I also increased my Kernel Log size to 64M, because my first attempt failed.

Full crash log: https://gist.github.com/Sumandora/9d86eb639d56116c2ee32207b49a0a44 I have included 3000 lines +/- to the actual segfault.

@Sumandora I also tried reproducing issue on below configuration setup but did not observed any crash after playing CSGO for an hour. Arch Linux + 5.18.1-arch1-1 + NVIDIA GeForce RTX 2080 + Driver 515.48.07 with --no-kernel-modules + DELL U2412M

Can you please confirm repro frequency and share nvidia bug report. Also let me know if you have any specific game or display settings.

I'm going to include a bit more system information: I already provided that in the csgo-issue-tracker-issue: https://gist.github.com/Sumandora/3d5d858d62d08addbd36a7ea293adc74 I use betterfs, KDE Plasma, Ryzen 5 2600 and a RTX 2060, 16 GB of Memory I got 2 Monitors, both of them are 4k. And my Arch Linux, like the game, is installed on a SSD

Some of these things might have a impact... But yes the bug is still reproducible and is not fixed.

mtijanic commented 2 years ago

Thanks! From just before the crash:

[  140.879819] NVRM RmFreeUnusedClients: freeing abandoned client 0xc1d0001a
[  140.879821] NVRM rmapiFreeClientListWithSecInfo: Nv01FreeClientList: numClients: 1

means that some process has closed its /dev/nvidia* fd (either via close() or just by dying and OS reaping the fd) without going through the proper teardown process; and for some reason this breaks GSP.

I have included 3000 lines +/- to the actual segfault.

Could you maybe upload the full log file? It's very verbose and the 6k does not provide enough context to see what client 0xc1d0001a is, or what it has allocated that is now problematic to free/reap. With the full log we can get an audit trail for the client in question.

Sumandora commented 2 years ago

Thanks! From just before the crash:

[  140.879819] NVRM RmFreeUnusedClients: freeing abandoned client 0xc1d0001a
[  140.879821] NVRM rmapiFreeClientListWithSecInfo: Nv01FreeClientList: numClients: 1

means that some process has closed its /dev/nvidia* fd (either via close() or just by dying and OS reaping the fd) without going through the proper teardown process; and for some reason this breaks GSP.

I have included 3000 lines +/- to the actual segfault.

Could you maybe upload the full log file? It's very verbose and the 6k does not provide enough context to see what client 0xc1d0001a is, or what it has allocated that is now problematic to free/reap. With the full log we can get an audit trail for the client in question.

dmesg.tar.gz I had to increase my kernel log length even more, now its 1 Gigabyte, because 64 Megabytes also wasnt enough to capture the log from beginning. However here is the full-full log you have been requesting. The file is about 17 MB compressed and 170 MB uncompressed.

mtijanic commented 2 years ago

Thank you, this is super helpful! Seems like this time it was different, and it's not a reap of an unused client, but a more controlled free, but still same problem:

[  161.064792] NVRM rmapiFreeWithSecInfo: Nv01Free: client:0xc1d0005e object:0xbeee0100
[  161.067693] NVRM rmapiUnmapFromCpuWithSecInfo: Nv04UnmapMemory: client:0xc1d00018 device:0xbeef0004 memory:0xcaf0004b pLinearAddr:144929000 flags:0x0
[  163.075327] NVRM threadStateYieldCpuIfNecessary: Yielding
[  165.069620] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 32dcb567d964d7 >= 32dcb567a68992
[  165.069624] NVRM _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
[  165.069629] NVRM: GPU at PCI:0000:29:00: GPU-f37e16b7-7bf3-5af6-0ca7-1d588cffdd1c
[  165.069631] NVRM: Xid (PCI:0000:29:00): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
[  165.069634] NVRM _issueRpcAndWait: rpcRecvPoll timedout for fn 10!
[  165.069635] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ rpc.c:200
[  165.069638] NVRM rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d0005e; hObject=0xbeeea0b5; paramsStatus=0x00000000; status=0x00000065

It is trying to release a DMA channel, which is probably still in use, and for some reason GSP is unable to stop it within the allotted 4s timeslice.

I'll pass this on into the internal bug to the engineers more familiar with these code paths.

PAR2020 commented 2 years ago

@Sumandora, can we get a nvidia-bug-report.log on this one with the open source driver? More data is needed from it to close with our devs. @mtijanic, would it be helpful to get all rm messages also for Nikita?

Thanks guys!

Sumandora commented 2 years ago

@Sumandora, can we get a nvidia-bug-report.log on this one with the open source driver? More data is needed from it to close with our devs. @mtijanic, would it be helpful to get all rm messages also for Nikita?

Thanks guys!

nvidia-bug-report.log.gz Does this help? ^^ If I have to do this again with the verbose logging, tell me and I'm going to do it as fast as possible

DonielMoins commented 2 years ago

I am having the same issue here as well, where can I find the log files to help out?

dylif commented 2 years ago

I am having the same issue here: Kernel: 5.17.7-arch1-1 Driver: nvidia-open 515.43.04

Seems like something in the kernel module is locking up whenever I run an OpenGL game (i.e. Minecraft), as there is a lot of reports of timeouts. Strangely running a Vulkan game (GTA 5 through Proton and DXVK) works perfectly.

This does not happen with the proprietary kernel modules; both OpenGL and Vulkan games work perfectly.

Please note I have tried adding the nvidia-drm.modeset=1 kernel parameter which changed nothing.

Attached is a snippet of a log that I think is of interest. nvidia.log

Hopefully this helps

andrewathalye commented 2 years ago

I can reproduce the exact same symptoms and errors on 5.15-41-gentoo with NVIDIA-open 515.48.07. I too have only observed it in CSGO Linux native and Minecraft - every other game I've played has not caused it, and neither have Wayland-native EGL applications that I tried.

T-D3V commented 2 years ago

I have this issue too when trying to launch minecraft right now: NVRM: Xid (PCI:0000:09:00): 119, pid='', name=, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0). I'm using current linux-zen kernel and nvidia-open-dkms obviously on arch linux

Can confirm works like a charm with closed-source drivers.

Sumandora commented 2 years ago

Just in case somebody is curious. The new version (515.57) hasn't fixed the problem.

frznfngrs commented 2 years ago

I am also experiencing this error with RTX 3090 with ML/AI workloads, 515.43.04 CUDA 11.7

PAR2020 commented 2 years ago

Hi all - we have a fix and getting it integrated. Will let you know which driver release it's in.

Sumandora commented 2 years ago

Hi all - we have a fix and getting it integrated. Will let you know which driver release it's in.

Sounds awesome. I'm hyped.

frznfngrs commented 2 years ago

Any update on this?

ngolovliov-nv commented 2 years ago

Does the latest driver release, 515.65, address the issue for you?

andrewathalye commented 2 years ago

The latest driver release appears to resolve crashing issues for me. I have not yet tested all apps that formerly crashed, but several no longer do.

Sumandora commented 2 years ago

Does the latest driver release, 515.65, address the issue for you?

I just tested it, it works. No more crashes. So I guess this can be closed. Thanks for patching the problems, now we can go crazy, gaming on the kernel-open drivers :+1:

bashirmindee commented 1 year ago

I re-experienced the issue with nvidia driver 515.65.01 and cuda 11.7 running TF2:

ubuntu@ip-172-31-72-137:~$ grep -i nvrm /var/log/syslog
Dec 14 09:15:17 ip-172-31-72-137 kernel: [1028046.649510] NVRM: Xid (PCI:0000:00:1e): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_CONTROL (0x0 0x0).
Dec 14 09:15:23 ip-172-31-72-137 kernel: [1028052.125825] NVRM: Xid (PCI:0000:00:1e): 120, pid='<unknown>', name=<unknown>, GSP Error: Task 1 raised error code 0xd for reason 0x0 at 0x5b119d4 (0 more errors skipped)
ghost commented 1 year ago

I'm still experiencing this on 525.85.05