NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.21k stars 1.28k forks source link

IOMMU error on ppc64le #683

Closed riptl closed 3 months ago

riptl commented 3 months ago

NVIDIA Open GPU Kernel Modules Version

448d5cc65624d3aa69015efa0d3fb50fd9729f41

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Fedora Linux 40 (Server Edition)

Kernel Release

Linux p9l1 6.9.8-200.fc40.ppc64le #1 SMP Fri Jul 5 15:53:24 UTC 2024 ppc64le GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

RTX 2060 (can't run nvidia-smi)

Describe the bug

[ 3661.083900] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 3661.211951] NVRM: memdescMapIommu: 0x8000000d2ff0000-0x8000000d2ff0fff is not addressable by GPU 0x100 [0x0-0x7fffffffffff]
[ 3661.212689] NVRM: nvCheckOkFailedNoLog: Check failed: Address not valid [NV_ERR_INVALID_ADDRESS] (0x0000001E) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1348
[ 3661.213413] NVRM: memdescMapIommu: 0x8000000d2ff0000-0x8000000d2ff0fff is not addressable by GPU 0x100 [0x0-0x7fffffffffff]
[ 3661.214125] NVRM: nvCheckOkFailedNoLog: Check failed: Address not valid [NV_ERR_INVALID_ADDRESS] (0x0000001E) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1348
[ 3661.214851] NVRM: nvAssertFailedNoLog: Assertion failed: pKernelMemorySystem->sysmemFlushBuffer != 0 @ kern_mem_sys_gm107.c:382
[ 3661.215725] NVRM: memdescMapIommu: 0x8000000d2ff0000 is not addressable by GPU 0x100 [0x0-0x7fffffffffff]
[ 3661.216452] NVRM: nvCheckOkFailedNoLog: Check failed: Address not valid [NV_ERR_INVALID_ADDRESS] (0x0000001E) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1348
[ 3661.217192] NVRM: nvAssertOkFailedNoLog: Assertion failed: Address not valid [NV_ERR_INVALID_ADDRESS] (0x0000001E) returned from nvStatus @ message_queue_cpu.c:241
[ 3661.217936] NVRM: _kgspInitRpcInfrastructure: GspMsgQueueInit failed
[ 3661.218664] NVRM: kgspConstructEngine_IMPL: init RPC infrastructure failed
[ 3661.219455] NVRM: osInitNvMapping: *** Cannot attach gpu
[ 3661.220187] NVRM: RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[ 3661.220923] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x1e:744)
[ 3661.222069] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 3661.224317] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[ 3661.236232] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
     *-pci:0
          description: PCI bridge
          product: POWER9 Host Bridge (PHB4)
          vendor: IBM
          physical id: 100
          bus info: pci@0000:00:00.0
          version: 00
          slot: UIO Slot1
          width: 32 bits
          clock: 33MHz
          capabilities: pci pm pciexpress normal_decode bus_master cap_list
          resources: memory:600c000000000-600c07fefffff ioport:6000000000000(size=273803182080)
        *-display
             description: VGA compatible controller
             product: TU106 [GeForce RTX 2060 Rev. A]
             vendor: NVIDIA Corporation
             physical id: 0
             bus info: pci@0000:01:00.0
             version: a1
             slot: UIO Slot1
             width: 64 bits
             clock: 33MHz
             capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
             configuration: driver=nvidia latency=0
             resources: iomemory:600000-5fffff iomemory:600000-5fffff irq:43 memory:600c000000000-600c000ffffff memory:6000000000000-600000fffffff memory:6000010000000-6000011ffffff memory:600c001000000-600c00107ffff

To Reproduce

Build on ppc64le with 64k pages and load nvidia-drm.ko

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

This is on ppc64le with 64KiB pages

riptl commented 3 months ago

Please note that the Makefiles for ppc64le are broken in various ways, so building nvidia-drm.ko required some changes. I can submit patches to fix them if there is community/maintainer interest.

riptl commented 3 months ago

Debug log:

[411161.437562] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[411161.438400] NVRM: GPU 0000:01:00.0: Opening GPU with minor number 0
[411161.443030] NVRM: GPU 0000:01:00.0: RmInitAdapter
[411161.443035] NVRM: GPU 0000:01:00.0: RmSetupRegisters for 0x10de:0x1f08
[411161.443040] NVRM: GPU 0000:01:00.0: pci config info:
[411161.443043] NVRM: GPU 0000:01:00.0:    registers look  like: 0x600c000000000 0x1000000NVRM: GPU 0000:01:00.0:    fb        looks like: 0x6000000000000 0x10000000NVRM: GPU 0000:01:00.0: Successfully mapped framebuffer and registers
[411161.443065] NVRM: GPU 0000:01:00.0: final mappings:
[411161.443068] NVRM: GPU 0000:01:00.0:     regs: 0x600c000000000 0x1000000 0x00000000d5906fea
[411161.565307] NVRM: VM: nv_alloc_pages: 1 pages, nodeid -1
[411161.565927] NVRM: VM:    contig 1  cache_type 1
[411161.571021] NVRM: VM: nv_alloc_contig_pages: 1 pages
[411161.572415] NVRM: VM: nv_alloc_pages:3790: 0x0000000058493b7b, 1 page(s), count = 1, page_table = 0x00000000e1bcb11b
[411161.572442] NVRM: memdescMapIommu: 0x800000d87910000-0x800000d87910fff is not addressable by GPU 0x100 [0x0-0x7fffffffffff]
[411161.572449] NVRM: VM: nv_free_pages: 0x1
[411161.572451] NVRM: VM: nv_free_pages:3813: 0x0000000058493b7b, 1 page(s), count = 1, page_table = 0x00000000e1bcb11b
[411161.572456] NVRM: VM: nv_free_contig_pages: 1 pages
[411161.572464] NVRM: nvCheckOkFailedNoLog: Check failed: Address not valid [NV_ERR_INVALID_ADDRESS] (0x0000001E) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1348
[411161.572472] NVRM: VM: nv_alloc_pages: 1 pages, nodeid -1
[411161.572474] NVRM: VM:    contig 1  cache_type 1
[411161.572501] NVRM: VM: nv_alloc_contig_pages: 1 pages
[411161.572508] NVRM: VM: nv_alloc_pages:3790: 0x000000002495e64f, 1 page(s), count = 1, page_table = 0x00000000b61a286d
[411161.572518] NVRM: memdescMapIommu: 0x800000d0c430000-0x800000d0c430fff is not addressable by GPU 0x100 [0x0-0x7fffffffffff]
[411161.572521] NVRM: VM: nv_free_pages: 0x1
[411161.572523] NVRM: VM: nv_free_pages:3813: 0x000000002495e64f, 1 page(s), count = 1, page_table = 0x00000000b61a286d
[411161.572527] NVRM: VM: nv_free_contig_pages: 1 pages
[411161.572532] NVRM: nvCheckOkFailedNoLog: Check failed: Address not valid [NV_ERR_INVALID_ADDRESS] (0x0000001E) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1348
[411161.572537] NVRM: nvAssertFailedNoLog: Assertion failed: pKernelMemorySystem->sysmemFlushBuffer != 0 @ kern_mem_sys_gm107.c:382
[411161.572695] NVRM: VM: nv_alloc_pages: 9 pages, nodeid -1
[411161.572698] NVRM: VM:    contig 0  cache_type 0
[411161.572701] NVRM: VM: nv_alloc_system_pages: 9 order0 pages, 0 order
[411161.572744] NVRM: VM: nv_alloc_pages:3790: 0x000000002495e64f, 9 page(s), count = 1, page_table = 0x00000000578cc9ba
[411161.572763] NVRM: memdescMapIommu: 0x800000d0c430000 is not addressable by GPU 0x100 [0x0-0x7fffffffffff]
[411161.572769] NVRM: VM: nv_free_pages: 0x9
[411161.572771] NVRM: VM: nv_free_pages:3813: 0x000000002495e64f, 9 page(s), count = 1, page_table = 0x00000000578cc9ba
[411161.572775] NVRM: VM: nv_free_system_pages: 9 pages
[411161.572781] NVRM: nvCheckOkFailedNoLog: Check failed: Address not valid [NV_ERR_INVALID_ADDRESS] (0x0000001E) returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1348
[411161.572785] NVRM: nvAssertOkFailedNoLog: Assertion failed: Address not valid [NV_ERR_INVALID_ADDRESS] (0x0000001E) returned from nvStatus @ message_queue_cpu.c:241
[411161.572792] NVRM: _kgspInitRpcInfrastructure: GspMsgQueueInit failed
[411161.572795] NVRM: kgspConstructEngine_IMPL: init RPC infrastructure failed
[411161.572928] NVRM: osInitNvMapping: *** Cannot attach gpu
[411161.572934] NVRM: RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[411161.572938] NVRM: GPU 0000:01:00.0: Tearing down registers
[411161.572946] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x1e:744)
[411161.573203] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[411161.574653] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[411161.575401] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
aritger commented 3 months ago

Thank you for your report. Unfortunately, we don't currently plan to support ppc64le with the open-gpu-kernel-modules. Closing; sorry.