ROCm / ROCK-Kernel-Driver

AMDGPU Driver with KFD used by the ROCm project. Also contains the current Linux Kernel that matches this base driver
Other
324 stars 99 forks source link

ROCK-1.5 fails to boot #20

Closed jvesely closed 7 years ago

jvesely commented 7 years ago

Trying t boot the recent ROC-1.5 kernel results in a string of error messages:

AMD-Vi: Event logged
AMD-Vi: Completion-Wait loop timed out

Most devices failed to operate (ata lnk errors, disk storage errors, ...). I used fedora config and enabled DC during configuration. Note that stock fedora kernel fails the same way on this machine[0] ROCK kernel produces few extra errors in dmesg during startup (might be unrelated):

[    0.020470] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: 10 CPUID: 0
[    0.020476] [Firmware Bug]: CPU0: Using firmware package id 1 instead of 0
[    0.020479] Last level iTLB entries: 4KB 512, 2MB 1024, 4MB 512
[    0.020480] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 512, 1GB 0
[    0.021477] Freeing SMP alternatives memory: 32K (ffffffffa1197000 - ffffffffa119f000)
[    0.025234] ftrace: allocating 31008 entries in 122 pages
[    0.037531] smpboot: APIC(10) Converting physical 1 to logical package 0
[    0.037533] smpboot: Max logical packages: 2
[    0.037940] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.149160] smpboot: CPU0: AMD FX-9800P RADEON R7, 12 COMPUTE CORES 4C+8G (family: 0x15, model: 0x65, stepping: 0x1)
[    0.149164] Performance Events: Fam15h core perfctr, AMD PMU driver.
[    0.149169] ... version:                0
[    0.149169] ... bit width:              48
[    0.149170] ... generic registers:      6
[    0.149170] ... value mask:             0000ffffffffffff
[    0.149171] ... max period:             00007fffffffffff
[    0.149171] ... fixed-purpose events:   0
[    0.149171] ... event mask:             000000000000003f
[    0.150000] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[    0.150137] x86: Booting SMP configuration:
[    0.150139] .... node  #0, CPUs:      #1
[    0.150300] [Firmware Bug]: CPU1: APIC id mismatch. Firmware: 11 CPUID: 1
[    0.150302] [Firmware Bug]: CPU1: Using firmware package id 1 instead of 0
[    0.162359]  #2
[    0.162575] [Firmware Bug]: CPU2: APIC id mismatch. Firmware: 12 CPUID: 2
[    0.162576] [Firmware Bug]: CPU2: Using firmware package id 1 instead of 0
[    0.175608]  #3
[    0.175804] [Firmware Bug]: CPU3: APIC id mismatch. Firmware: 13 CPUID: 3
[    0.175805] [Firmware Bug]: CPU3: Using firmware package id 1 instead of 0

the machine is acer spire e 15 (e5-553g-f55f) using lates BIOS update (2017/04/25, v1.16)

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1448121

jvesely commented 7 years ago

the machine boots ok when using iommu=soft on kernel comdline, but there are other errors:

[  522.886053] [drm] ring test on 0 succeeded in 16 usecs
[  522.886221] [drm] ring test on 9 succeeded in 14 usecs
[  522.886291] [drm] ring test on 1 succeeded in 4 usecs
[  523.128525] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 2 test failed (scratch(0xC040)=0xCAFEDEAD)
[  523.363125] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 3 test failed (scratch(0xC040)=0xCAFEDEAD)
[  523.591879] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 4 test failed (scratch(0xC040)=0xCAFEDEAD)
[  523.778308] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 5 test failed (scratch(0xC040)=0xCAFEDEAD)
[  524.006917] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 6 test failed (scratch(0xC040)=0xCAFEDEAD)
[  524.235767] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 7 test failed (scratch(0xC040)=0xCAFEDEAD)
[  524.464100] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 8 test failed (scratch(0xC040)=0xCAFEDEAD)
[  524.464163] [drm:amdgpu_resume [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -22
[  524.464225] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-22).
[  524.464439] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  524.464557] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  524.471415] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  524.471493] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
jvesely commented 7 years ago

ROCK-1.4 boots ok on the same machine

jvesely commented 7 years ago

I can confirm that patch [0] from [1] applied on top of rock 1.5.x fixes the issue for me.

[0]https://patchwork.freedesktop.org/patch/146519/ [1]https://bugs.freedesktop.org/show_bug.cgi?id=101029 I still see the following in dmesg when I try to use the dGPU, but otherwise the machine boots and both opengl and opencl work at least on iGPU:

[  158.110741] [drm] PCIE GART of 32768M enabled (table at 0x0000000000040000).
[  158.113244] amdgpu: [powerplay] can't get the mac of 5
[  158.127298] amdgpu 0000:07:00.0: CU info asic_type [0xa] not supported
[  158.128583] [drm] ring test on 0 succeeded in 14 usecs
[  158.128741] [drm] ring test on 9 succeeded in 13 usecs
[  158.128809] [drm] ring test on 1 succeeded in 5 usecs
[  158.372989] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 2 test failed (scratch(0xC040)=0xCAFEDEAD)
[  158.613855] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 3 test failed (scratch(0xC040)=0xCAFEDEAD)
[  158.841997] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 4 test failed (scratch(0xC040)=0xCAFEDEAD)
[  159.071920] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 5 test failed (scratch(0xC040)=0xCAFEDEAD)
[  159.300660] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 6 test failed (scratch(0xC040)=0xCAFEDEAD)
[  159.530630] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 7 test failed (scratch(0xC040)=0xCAFEDEAD)
[  159.717313] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 8 test failed (scratch(0xC040)=0xCAFEDEAD)
[  159.717344] [drm:amdgpu_resume [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -22
[  159.717371] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-22).
[  159.717492] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  159.717585] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off.
[  165.542874] amdgpu: [powerplay] VI should always have 2 performance levels
[  173.943442] amdgpu 0000:07:00.0: GPU pci config reset
jvesely commented 7 years ago

fixed in rock-1.5.1 by this commit:

d2b8bb37105658bdbe089b4dada08602c7a95aa8

Author: Arindam Nath arindam.nath@amd.com 2017-03-27 02:17:07 Committer: Felix Kuehling Felix.Kuehling@amd.com 2017-05-18 14:08:11 iommu/amd: flush IOTLB for specific domains only

still broken in 1.6.x