NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.21k stars 1.28k forks source link

gpuHandleSanityCheckRegReadError_GM107: Possible bad register read #688

Open taochenlove opened 3 months ago

taochenlove commented 3 months ago

NVIDIA Open GPU Kernel Modules Version

560.28.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Ubuntu 22.04 LTS

Kernel Release

5.15.0-25-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

NVIDIA A100-PCIE-40GB

Describe the bug

When running nvidia-smi there are some exceptions printed below (base) root@D11DJ-3410-01:~/chenct# nvidia-smi -L [12919.827336] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88158, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827348] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88174, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827489] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x889d4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827624] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e2c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827628] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e30, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827632] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e34, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827636] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e38, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827639] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e3c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827642] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e40, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827646] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e44, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827649] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e48, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827652] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e4c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827656] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e50, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827659] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e54, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827662] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e58, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827665] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e5c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827668] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e60, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827671] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e64, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827674] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e68, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827677] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e6c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827680] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e70, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827682] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e74, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827685] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e78, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827689] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e7c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827691] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e80, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827694] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e84, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827697] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e88, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827699] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e8c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827703] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e90, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827705] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e94, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827708] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e98, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827711] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e9c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827714] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ea0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827717] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ea4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827720] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ea8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827722] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88eac, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827726] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88eb0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827729] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88eb4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827731] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88eb8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827734] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ebc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827737] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ec0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827740] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ec4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827743] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ec8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827746] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ecc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827749] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ed0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827752] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ed4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827755] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ed8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827758] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88edc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827761] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88fe4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR

To Reproduce

"./cuda_12.6.0_560.28.03_linux.run -m=kernel-open" .Use this command after the installation will appear.

Bug Incidence

Always

nvidia-bug-report.log.gz

none

More Info

No response

gauravjuvekar commented 3 months ago

Tracked internally as Bug 4290269

drastx commented 2 months ago

I am seeing the same error code NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR on GH200, could these be related? Ubuntu 24.04 and nvidia's ghvirt 6.5.3 based kernel, driver 550

[ 5.868764] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920bc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.870053] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920c0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.871339] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920c4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.872554] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920c8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.873768] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920cc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.875016] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920d0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.876228] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920e4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.877451] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920e8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.878672] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920ec, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.879958] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920f0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.881140] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920f4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.882292] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920f8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.883533] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920fc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR

gauravjuvekar commented 2 months ago

Yes, this is the same bug which affects release 550 and later.

apoorvemohan commented 1 month ago

We are seeing the following error on AMD system A100 40GB system with Nvidia Driver 550 and CUDA 12.4 (Ubuntu 22.04 LTS).

[   37.333506] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88158,  regvalue: 0xbadf5040,  error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR

cc: @mengmeiye

levipereira commented 1 month ago

Same bug Ubuntu 22.04 LTS

using NVIDIA-Linux-x86_64-560.35.03.run

GPU RTX 4090

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: AuthenticAMD Model name: AMD Ryzen 7 3700X 8-Core Processor

jwatte commented 6 days ago

Is this actually a problem, though? I see this in dmesg on bootup on many 8xH100 nodes (with quad xeon host) but it seems to work after this.

NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920f0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR