NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
14.78k stars 1.21k forks source link

"NVRM RmInitAdapter: Cannot initialize GSP firmware RM" error found #673

Open jacksonsshen opened 2 weeks ago

jacksonsshen commented 2 weeks ago

NVIDIA Open GPU Kernel Modules Version

520.56.06

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Ubuntu 20.04.6 LTS

Kernel Release

5.10.14

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

NVIDIA GeForce RTX 3080

Describe the bug

We have deployed a ubuntu machine with an Open GPU Kernel Modules 520 nvidia driver. But the machine often has some exceptions. The error is as follows:

NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa 2024-07-02 18:46:08.681559 kernel:[ 21.766727] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff 2024-07-02 18:46:08.681562 kernel:[ 21.766734] NVRM nvAssertFailedNoLog: Assertion failed: rmStatus == NV_OK @ osinit.c:1982

ecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164 [ 1731.589314] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235 [ 1731.589317] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff [ 1731.589323] NVRM RmInitAdapter: Cannot initialize GSP firmware RM [ 1731.591779] NVRM: GPU 0000:86:00.0: RmInitAdapter failed! (0x63:0xffff:1684) [ 1731.593977] NVRM: GPU 0000:86:00.0: rm_init_adapter failed, device minor number 0 [ 1731.777872] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa [ 1731.777876] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff [ 1731.800951] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe [ 1731.800957] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164 [ 1731.800963] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235 [ 1731.800965] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff [ 1731.800970] NVRM RmInitAdapter: Cannot initialize GSP firmware RM [ 1731.803388] NVRM: GPU 0000:af:00.0: RmInitAdapter failed! (0x63:0xffff:1684) [ 1731.805517] NVRM: GPU 0000:af:00.0: rm_init_adapter failed, device minor number 1 [ 1731.989155] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa [ 1731.989160] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff [ 1732.012716] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe [ 1732.012722] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164

To Reproduce

Using 520.56.06 open-source nvidia driver and starting the machine

Bug Incidence

Sometimes

nvidia-bug-report.log.gz

NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa 2024-07-02 18:46:08.681559 kernel:[ 21.766727] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff 2024-07-02 18:46:08.681562 kernel:[ 21.766734] NVRM nvAssertFailedNoLog: Assertion failed: rmStatus == NV_OK @ osinit.c:1982

ecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164 [ 1731.589314] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235 [ 1731.589317] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff [ 1731.589323] NVRM RmInitAdapter: Cannot initialize GSP firmware RM [ 1731.591779] NVRM: GPU 0000:86:00.0: RmInitAdapter failed! (0x63:0xffff:1684) [ 1731.593977] NVRM: GPU 0000:86:00.0: rm_init_adapter failed, device minor number 0 [ 1731.777872] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa [ 1731.777876] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff [ 1731.800951] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe [ 1731.800957] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164 [ 1731.800963] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235 [ 1731.800965] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff [ 1731.800970] NVRM RmInitAdapter: Cannot initialize GSP firmware RM [ 1731.803388] NVRM: GPU 0000:af:00.0: RmInitAdapter failed! (0x63:0xffff:1684) [ 1731.805517] NVRM: GPU 0000:af:00.0: rm_init_adapter failed, device minor number 1 [ 1731.989155] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa [ 1731.989160] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff [ 1732.012716] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe [ 1732.012722] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164

More Info

No response

ptr1337 commented 2 weeks ago

I think, you should try this also with newer versions, since 520 is not supported anymore.

There are:

Branches.