Open taochenlove opened 3 months ago
Tracked internally as Bug 4290269
I am seeing the same error code NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR on GH200, could these be related? Ubuntu 24.04 and nvidia's ghvirt 6.5.3 based kernel, driver 550
[ 5.868764] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920bc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.870053] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920c0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.871339] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920c4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.872554] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920c8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.873768] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920cc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.875016] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920d0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.876228] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920e4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.877451] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920e8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.878672] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920ec, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.879958] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920f0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.881140] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920f4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.882292] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920f8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [ 5.883533] NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920fc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
Yes, this is the same bug which affects release 550 and later.
We are seeing the following error on AMD system A100 40GB system with Nvidia Driver 550 and CUDA 12.4 (Ubuntu 22.04 LTS).
[ 37.333506] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88158, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
cc: @mengmeiye
Same bug Ubuntu 22.04 LTS
using NVIDIA-Linux-x86_64-560.35.03.run
GPU RTX 4090
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: AuthenticAMD Model name: AMD Ryzen 7 3700X 8-Core Processor
Is this actually a problem, though? I see this in dmesg on bootup on many 8xH100 nodes (with quad xeon host) but it seems to work after this.
NVRM: gpuHandleSanityCheckRegReadError_GH100: Possible bad register read: addr: 0x920f0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
NVIDIA Open GPU Kernel Modules Version
560.28.03
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Ubuntu 22.04 LTS
Kernel Release
5.15.0-25-generic
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
NVIDIA A100-PCIE-40GB
Describe the bug
When running nvidia-smi there are some exceptions printed below (base) root@D11DJ-3410-01:~/chenct# nvidia-smi -L [12919.827336] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88158, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827348] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88174, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827489] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x889d4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827624] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e2c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827628] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e30, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827632] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e34, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827636] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e38, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827639] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e3c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827642] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e40, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827646] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e44, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827649] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e48, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827652] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e4c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827656] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e50, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827659] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e54, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827662] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e58, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827665] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e5c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827668] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e60, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827671] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e64, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827674] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e68, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827677] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e6c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827680] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e70, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827682] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e74, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827685] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e78, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827689] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e7c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827691] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e80, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827694] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e84, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827697] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e88, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827699] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e8c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827703] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e90, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827705] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e94, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827708] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e98, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827711] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88e9c, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827714] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ea0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827717] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ea4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827720] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ea8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827722] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88eac, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827726] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88eb0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827729] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88eb4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827731] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88eb8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827734] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ebc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827737] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ec0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827740] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ec4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827743] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ec8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827746] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ecc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827749] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ed0, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827752] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ed4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827755] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88ed8, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827758] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88edc, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR [12919.827761] NVRM: gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x88fe4, regvalue: 0xbadf5040, error code: NV_PPRIV_SYS_PRI_ERROR_CODE_FECS_PRI_CLIENT_ERR
To Reproduce
"./cuda_12.6.0_560.28.03_linux.run -m=kernel-open" .Use this command after the installation will appear.
Bug Incidence
Always
nvidia-bug-report.log.gz
none
More Info
No response