ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
https://rocmdocs.amd.com/projects/HIP/
MIT License
3.76k stars 538 forks source link

kernel oops when running hip kernel with dev branch ROCR/ROCK #15

Closed mattmacy closed 8 years ago

mattmacy commented 8 years ago

I was able to do the tutorial on gpuopen.com but found that hipGetDeviceCount was only returning 1 so the examples would only run on my primary GPU a GTX 980Ti. I also have an R9 Nano and an R9 Fury. The kfd driver exports 3 nodes under topology so the runtime should let me talk to them. I'm running Ubuntu 15. I was hoping to instrument hip_hcc.cpp to see what it was doing right here:

/*
  * Build a table of valid compute devices.
  */
 auto accs = hc::accelerator::get_all();
 int deviceCnt = 0;
 for (int i=0; i<accs.size(); i++) {
     if (! accs[i].get_is_emulated()) {
         deviceCnt++;
     }
 };
 -
 +    printf("actual device count is %d\n", deviceCnt);
 // Make sure the hip visible devices are within the deviceCnt range
 for (int i = 0; i < g_hip_visible_devices.size(); i++) {
     if(g_hip_visible_devices[i] >= deviceCnt){
         // Make sure any DeviceID after invalid DeviceID will be erased.
         g_hip_visible_devices.resize(i);
         break;
     }
 }

But I can't even get it to compile: ~/devel/HIP2$ make ./bin/hipcc -I/opt/hcc/include -std=c++11 -I/opt/hsa/include src/hip_hcc.cpp -c -O3 -o src/hip_hcc.o src/hip_hcc.cpp:52:2: error: #error (USE_AM_TRACKER requries HCC version of 16074 or newer)

error (USE_AM_TRACKER requries HCC version of 16074 or newer)

^ Died at ./bin/hipcc line 208. Makefile:20: recipe for target 'src/hip_hcc.o' failed make: *\ [src/hip_hcc.o] Error 1

I made the following change to the Makefile in response to complaints. But it's still not doing anything. And it looks like it's trying to compile the code with nvcc: mmacy@pandemonium:~/devel/HIP2$ hipcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2015 NVIDIA Corporation Built on Tue_Aug_11_14:27:32_CDT_2015 Cuda compilation tools, release 7.5, V7.5.17

bensander commented 8 years ago

Hi Matt, please try setting env var HIP_PLATFORM to "hcc" so hip will recognize the nanos.

On Mar 23, 2016, at 11:36 PM, Matthew Macy notifications@github.com<mailto:notifications@github.com> wrote:

I was able to do the tutorial on gpuopen.comhttp://gpuopen.com but found that hipGetDeviceCount was only returning 1 so the examples would only run on my primary GPU a GTX 980Ti. I also have an R9 Nano and an R9 Fury. The kfd driver exports 3 nodes under topology so the runtime should let me talk to them. I'm running Ubuntu 15. I was hoping to instrument hip_hcc.cpp to see what it was doing right here:

/*

But I can't even get it to compile: ~/devel/HIP2$ make ./bin/hipcc -I/opt/hcc/include -std=c++11 -I/opt/hsa/include src/hip_hcc.cpp -c -O3 -o src/hip_hcc.o src/hip_hcc.cpp:52:2: error: #error (USE_AM_TRACKER requries HCC version of 16074 or newer)

error (USE_AM_TRACKER requries HCC version of 16074 or newer)

^ Died at ./bin/hipcc line 208. Makefile:20: recipe for target 'src/hip_hcc.o' failed make: *\ [src/hip_hcc.o] Error 1

I made the following change to the Makefile in response to complaints. But it's still not doing anything. And it looks like it's trying to compile the code with nvcc: mmacy@pandemonium:~/devel/HIP2$ hipcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2015 NVIDIA Corporation Built on Tue_Aug_11_14:27:32_CDT_2015 Cuda compilation tools, release 7.5, V7.5.17

You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHubhttps://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/issues/15

mattmacy commented 8 years ago

I see - that tells it which compiler to use.

hipcc square.cpp In file included from /home/mmacy/devel/HIP/src/hip_hcc.cpp:42: In file included from /home/mmacy/devel/HIP/include/hip_runtime.h:54: In file included from /home/mmacy/devel/HIP/include/hcc_detail/hip_runtime.h:41: In file included from /home/mmacy/devel/HIP/include/hip_runtime_api.h:196: /home/mmacy/devel/HIP/include/hcc_detail/hip_runtime_api.h:35:2: error: ("This version of HIP requires a newer version of HCC.");

error("This version of HIP requires a newer version of HCC.");

^ /home/mmacy/devel/HIP/src/hip_hcc.cpp:2354:17: warning: unused variable 'stream' [-Wunused-variable] hipStream_t stream = ihipSyncAndResolveStream(hipStreamNull); ^ 1 warning and 1 error generated. remake-deps failed at /home/mmacy/devel/HIP/bin/hipcc line 179.

That doesn't work so well.

I've installed the most recent .deb from https://bitbucket.org/multicoreware/hcc/downloads. I take it I need to download the hcc sources as well?

I see. Their latest .deb is 16045. Your sources require 16074 or later.

I'm trying the following to see if I get a working hcc: https://bitbucket.org/multicoreware/hcc/wiki/Developer%20Information

mattmacy commented 8 years ago

Progress. I'm running 16124. It looks like you're out of sync with hsa: mmacy@pandemonium:~/devel/hcc/build$ hcc --version HCC clang version 3.5.0 (based on HCC 0.10.16124-89bbf6f-7e4cd9e LLVM 3.5.0svn) Target: x86_64-unknown-linux-gnu Thread model: posix mmacy@pandemonium:~/devel/hcc/build$ cd ../.. mmacy@pandemonium:~/devel$ cd HIP/samples/0_Intro/ bit_extract/ square/
mmacy@pandemonium:~/devel$ cd HIP/samples/0_Intro/square/ mmacy@pandemonium:~/devel/HIP/samples/0_Intro/square$ hipcc square.cpp /home/mmacy/devel/HIP/src/hip_hcc.cpp:2093:22: error: no matching function for call to 'hsa_amd_memory_async_copy' hsa_status = hsa_amd_memory_async_copy(dstp, _device->_hsa_agent, locked_srcp, _device->_hsa_agent, theseBytes, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]); ^~~~~~~~~ /opt/hsa/include/hsa_ext_amd.h:452:5: note: candidate function not viable: requires 7 arguments, but 8 were provided hsa_amd_memory_async_copy(void* dst, const void* src, size_t size, ^ /home/mmacy/devel/HIP/src/hip_hcc.cpp:2155:35: error: no matching function for call to 'hsa_amd_memory_async_copy' hsa_status_t hsa_status = hsa_amd_memory_async_copy(dstp, _device->_hsa_agent, _pinnedStagingBuffer[bufferIndex], _device->_hsa_agent, theseBytes, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]); ^~~~~~~~~ /opt/hsa/include/hsa_ext_amd.h:452:5: note: candidate function not viable: requires 7 arguments, but 8 were provided hsa_amd_memory_async_copy(void* dst, const void* src, size_t size, ^ /home/mmacy/devel/HIP/src/hip_hcc.cpp:2208:39: error: no matching function for call to 'hsa_amd_memory_async_copy' hsa_status_t hsa_status = hsa_amd_memory_async_copy(_pinnedStagingBuffer[bufferIndex], _device->_hsa_agent, srcp0, _device->_hsa_agent, theseBytes, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]); ^~~~~~~~~ /opt/hsa/include/hsa_ext_amd.h:452:5: note: candidate function not viable: requires 7 arguments, but 8 were provided hsa_amd_memory_async_copy(void* dst, const void* src, size_t size, ^ /home/mmacy/devel/HIP/src/hip_hcc.cpp:2333:35: error: no matching function for call to 'hsa_amd_memory_async_copy' hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, device->_hsa_agent, src, device->_hsa_agent, sizeBytes, depSignalCnt, depSignalCnt ? &depSignal:0x0, device->_copy_signal); ^~~~~~~~~ /opt/hsa/include/hsa_ext_amd.h:452:5: note: candidate function not viable: requires 7 arguments, but 8 were provided hsa_amd_memory_async_copy(void* dst, const void* src, size_t size, ^ /home/mmacy/devel/HIP/src/hip_hcc.cpp:2463:39: error: no matching function for call to 'hsa_amd_memory_async_copy' hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, device->_hsa_agent, src, device->_hsa_agent, sizeBytes, depSignalCnt, depSignalCnt ? &depSignal:0x0, ihip_signal->_hsa_signal); ^~~~~~~~~ /opt/hsa/include/hsa_ext_amd.h:452:5: note: candidate function not viable: requires 7 arguments, but 8 were provided hsa_amd_memory_async_copy(void* dst, const void* src, size_t size, ^ 5 errors generated. remake-deps failed at /home/mmacy/devel/HIP/bin/hipcc line 179.

mattmacy commented 8 years ago

I don't know what the situation is with the ROCR_V2 API. The async memcpy in what I assume is the canonical hsa_ext_amd.h: https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa_ext_amd.h looks more like the old one.

I made the following changes to hip_hcc.cpp to get my square.cpp to compile using hcc as the HIP_PLATFORM:

index 57d55a1..776e7c6 100644
--- a/src/hip_hcc.cpp
+++ b/src/hip_hcc.cpp
@@ -2090,7 +2090,7 @@ void StagingBuffer::CopyHostToDevicePinInPlace(void* dst, const void* src, size_
         hsa_signal_store_relaxed(_completion_signal[bufferIndex], 1);

 #if USE_ROCR_V2
-        hsa_status = hsa_amd_memory_async_copy(dstp, _device->_hsa_agent, locked_srcp, _device->_hsa_agent, theseBytes, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
+        hsa_status = hsa_amd_memory_async_copy(dstp, locked_srcp, theseBytes, _device->_hsa_agent, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
 #else
         assert(0);
 #endif
@@ -2152,7 +2152,7 @@ void StagingBuffer::CopyHostToDevice(void* dst, const void* src, size_t sizeByte
         hsa_signal_store_relaxed(_completion_signal[bufferIndex], 1);

 #if USE_ROCR_V2
-        hsa_status_t hsa_status = hsa_amd_memory_async_copy(dstp, _device->_hsa_agent, _pinnedStagingBuffer[bufferIndex], _device->_hsa_agent, theseBytes, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
+        hsa_status_t hsa_status = hsa_amd_memory_async_copy(dstp,  _pinnedStagingBuffer[bufferIndex], theseBytes, _device->_hsa_agent, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
 #else
         hsa_status_t hsa_status = hsa_amd_memory_async_copy(dstp, _pinnedStagingBuffer[bufferIndex], theseBytes, _device->_hsa_agent, 0, NULL, _completion_signal[bufferIndex]);
 #endif
@@ -2205,7 +2205,7 @@ void StagingBuffer::CopyDeviceToHost(void* dst, const void* src, size_t sizeByte
             tprintf (TRACE_COPY2, "D2H: bytesRemaining0=%zu  async_copy %zu bytes src:%p to staging:%p\n", bytesRemaining0, theseBytes, srcp0, _pinnedStagingBuffer[bufferIndex]);
             hsa_signal_store_relaxed(_completion_signal[bufferIndex], 1);
 #if USE_ROCR_V2
-            hsa_status_t hsa_status = hsa_amd_memory_async_copy(_pinnedStagingBuffer[bufferIndex], _device->_hsa_agent, srcp0, _device->_hsa_agent, theseBytes, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
+            hsa_status_t hsa_status = hsa_amd_memory_async_copy(_pinnedStagingBuffer[bufferIndex], srcp0, theseBytes, _device->_hsa_agent, waitFor ? 1:0, waitFor, _completion_signal[bufferIndex]);
 #else
             hsa_status_t hsa_status = hsa_amd_memory_async_copy(_pinnedStagingBuffer[bufferIndex], srcp0, theseBytes, _device->_hsa_agent, 0, NULL, _completion_signal[bufferIndex]);
 #endif
@@ -2330,7 +2330,7 @@ void ihipSyncCopy(ihipStream_t *stream, void* dst, const void* src, size_t sizeB

 #if USE_ROCR_V2
-        hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, device->_hsa_agent, src, device->_hsa_agent, sizeBytes, depSignalCnt, depSignalCnt ? &depSignal:0x0, device->_copy_signal);
+        hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, src, sizeBytes, device->_hsa_agent, depSignalCnt, depSignalCnt ? &depSignal:0x0, device->_copy_signal);
 #else
         hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, src, sizeBytes, device->_hsa_agent, 0, NULL, device->_copy_signal);
 #endif
@@ -2460,7 +2460,7 @@ hipError_t hipMemcpyAsync(void* dst, const void* src, size_t sizeBytes, hipMemcp

             tprintf (TRACE_SYNC, " copy-async, waitFor=%lu completion=#%lu(%lu)\n", depSignalCnt? depSignal.handle:0x0, ihip_signal->_sig_id, ihip_signal->_hsa_signal.handle);

-            hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, device->_hsa_agent, src, device->_hsa_agent, sizeBytes, depSignalCnt, depSignalCnt ? &depSignal:0x0, ihip_signal->_hsa_signal);
+            hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, src, sizeBytes, device->_hsa_agent, depSignalCnt, depSignalCnt ? &depSignal:0x0, ihip_signal->_hsa_signal);
 #else
             hsa_status_t hsa_status = hsa_amd_memory_async_copy(dst, src, sizeBytes, device->_hsa_agent, 0, NULL, ihip_signal->_hsa_signal);
whchung commented 8 years ago

Hi Matthew, can you try switch to "dev" branch on both ROCK-Kernel-Driver and ROCR-Runtime? You shall be able to find newer async_copy API which works with HIP over there.

mattmacy commented 8 years ago

What do I do to just re-build the driver? Thanks.

mattmacy commented 8 years ago

And for that matter - how do I rebuild the runtime. There's no makefile in the root.

whchung commented 8 years ago

Hi Matthew, you don't need to build them. On "dev" branch of ROCK-Kernel-Driver you can find a "package" directory which has ubuntu & fedora packages inside. And you can also find pre-built packages under "package" directory in ROCR-Runtime. Please do remember to switch to "dev" branch on both repositories though.

mattmacy commented 8 years ago

OK. Great. Thanks. I'll do that in the morning and let you know how that goes. In the meantime the patched version works for me.

I do notice that AMD kernels are much slower than Nvidia kernels:

mmacy@pandemonium:~/devel/HIP/samples/0_Intro/square$ time !! time ./a.out deviceCount: 2 info: running on device Fiji info: allocate host mem ( 7.63 MB) info: allocate device mem ( 7.63 MB) info: copy Host2Device info: launch 'vector_square' kernel info: copy Device2Host info: check result PASSED!

real 0m1.203s user 0m0.088s sys 0m0.184s

mmacy@pandemonium:~/devel/HIP.old/samples/0_Intro/square$ time ./square.hip.out deviceCount: 1 info: running on device GeForce GTX 980 Ti info: allocate host mem ( 7.63 MB) info: allocate device mem ( 7.63 MB) info: copy Host2Device info: launch 'vector_square' kernel info: copy Device2Host info: check result PASSED!

real 0m0.273s user 0m0.028s sys 0m0.244s

Is that fundamental? Or does your job dispatch interface just need refinement?

Thanks.

whchung commented 8 years ago

Hi Matthew, there are many ongoing works to optimize all aspects of the stack. Please stay tuned. :)

mattmacy commented 8 years ago

OK. I updated both the kernel and the runtime to the 316 build. When running the square.cpp example with HIP_PLATFORM=hcc (nvcc still works fine) I now get a kernel oops:

Mar 24 11:28:34 pandemonium kernel: [ 639.895604] nvidia_uvm: Loaded the UVM driver, major device number 245 Mar 24 11:29:06 pandemonium kernel: [ 671.693636] amdgpu: vram aperture is out of 40bit address base: 0x383fc0000000 limit 0x383fd0000000 Mar 24 11:29:06 pandemonium kernel: [ 671.693749] amdgpu: vram aperture is out of 40bit address base: 0x383fe0000000 limit 0x383ff0000000 Mar 24 11:29:06 pandemonium kernel: [ 671.696239] amdgpu: vram aperture is out of 40bit address base: 0x383fc0000000 limit 0x383fd0000000 Mar 24 11:29:06 pandemonium kernel: [ 671.734321] amdgpu: vram aperture is out of 40bit address base: 0x383fe0000000 limit 0x383ff0000000 Mar 24 11:29:06 pandemonium kernel: [ 671.776858] BUG: unable to handle kernel paging request at ffffc90019ecd000 Mar 24 11:29:06 pandemonium kernel: [ 671.776863] IP: [] set_trap_handler+0x1a/0x30 [amdkfd] Mar 24 11:29:06 pandemonium kernel: [ 671.776879] PGD ffec8f067 PUD ffeca0067 PMD fcf5f2067 PTE 0 Mar 24 11:29:06 pandemonium kernel: [ 671.776883] Oops: 0002 [#1] SMP Mar 24 11:29:06 pandemonium kernel: [ 671.776886] Modules linked in: nvidia_uvm(POE) vmw_vsock_vmci_transport vsock vmw_vmci rfcomm bnep binfmt_misc hid_logitech_hidpp btusb btbcm btintel bluetooth b43 mac80211 nls_iso8859_1 cfg80211 ssb intel_rapl iosf_mbi x86_pkg_temp_thermal eeepc_wmi intel_powerclamp coretemp asus_wmi sparse_keymap video mxm_wmi kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel nvidia(POE) aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw sb_edac edac_core snd_hda_codec_realtek snd_usb_audio snd_hda_codec_generic snd_usbmidi_lib snd_seq_midi hid_logitech_dj snd_seq_midi_event snd_hda_codec_hdmi snd_rawmidi snd_hda_intel snd_hda_controller snd_hda_codec snd_seq snd_hda_core snd_hwdep snd_seq_device bcma snd_pcm snd_timer snd mei_me lpc_ich soundcore mei shpchp wmi tpm_infineon mac_hid parport_pc ppdev lp parport autofs4 hid_generic usbhid hid amdkfd amd_iommu_v2 amdgpu psmouse amd_gnb_bus i2c_algo_bit e1000e ttm drm_kms_helper ahci ptp libahci drm pps_core Mar 24 11:29:06 pandemonium kernel: [ 671.776940] CPU: 0 PID: 4040 Comm: a.out Tainted: P OE 4.1.0-201603162000-kfd-build-obsidian-82-generic #82 Mar 24 11:29:06 pandemonium kernel: [ 671.776943] Hardware name: iXsystems CSE-COR-AIR540/RAMPAGE V EXTREME, BIOS 1902 12/18/2015 Mar 24 11:29:06 pandemonium kernel: [ 671.776944] task: ffff880e71d93250 ti: ffff880ea4020000 task.ti: ffff880ea4020000 Mar 24 11:29:06 pandemonium kernel: [ 671.776946] RIP: 0010:[] [] set_trap_handler+0x1a/0x30 [amdkfd] Mar 24 11:29:06 pandemonium kernel: [ 671.776955] RSP: 0018:ffff880ea4023d48 EFLAGS: 00010286 Mar 24 11:29:06 pandemonium kernel: [ 671.776956] RAX: ffffc90019ecd000 RBX: ffff880ff0f92e00 RCX: 0000000000000000 Mar 24 11:29:06 pandemonium kernel: [ 671.776957] RDX: 0000000002400000 RSI: ffff880ff3251e20 RDI: ffff880ff0f92a00 Mar 24 11:29:06 pandemonium kernel: [ 671.776959] RBP: ffff880ea4023d48 R08: ffff880ff0f92a00 R09: 0000000000000000 Mar 24 11:29:06 pandemonium kernel: [ 671.776960] R10: ffff880f962af800 R11: 00007ffc39269e80 R12: ffff880ea4023dc0 Mar 24 11:29:06 pandemonium kernel: [ 671.776961] R13: ffff880fb1275018 R14: ffff880fb1275000 R15: ffff880ea4023dc0 Mar 24 11:29:06 pandemonium kernel: [ 671.776963] FS: 00007f8005ebc740(0000) GS:ffff880fff200000(0000) knlGS:0000000000000000 Mar 24 11:29:06 pandemonium kernel: [ 671.776965] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 24 11:29:06 pandemonium kernel: [ 671.776966] CR2: ffffc90019ecd000 CR3: 0000000f3f50b000 CR4: 00000000001407f0 Mar 24 11:29:06 pandemonium kernel: [ 671.776968] Stack: Mar 24 11:29:06 pandemonium kernel: [ 671.776969] ffff880ea4023d78 ffffffffc03d8883 ffff880ea4023dc0 fffffffffffffff2 Mar 24 11:29:06 pandemonium kernel: [ 671.776972] 000000000000001a 00000000fffffff2 ffff880ea4023e78 ffffffffc03d9ebf Mar 24 11:29:06 pandemonium kernel: [ 671.776975] ffff880f962af800 ffffffffc03d8810 ffff880fb1275000 00007ffc39269e80 Mar 24 11:29:06 pandemonium kernel: [ 671.776977] Call Trace: Mar 24 11:29:06 pandemonium kernel: [ 671.776986] [] kfd_ioctl_set_trap_handler+0x73/0xc0 [amdkfd] Mar 24 11:29:06 pandemonium kernel: [ 671.776994] [] kfd_ioctl+0x2bf/0x4d0 [amdkfd] Mar 24 11:29:06 pandemonium kernel: [ 671.777001] [] ? kfd_ioctl_get_process_apertures+0x2e0/0x2e0 [amdkfd] Mar 24 11:29:06 pandemonium kernel: [ 671.777010] [] ? pte_alloc_one+0x30/0x50 Mar 24 11:29:06 pandemonium kernel: [ 671.777015] [] ? __pte_alloc+0xcc/0x180 Mar 24 11:29:06 pandemonium kernel: [ 671.777019] [] do_vfs_ioctl+0x2f8/0x510 Mar 24 11:29:06 pandemonium kernel: [ 671.777023] [] ? __do_page_fault+0x1b6/0x450 Mar 24 11:29:06 pandemonium kernel: [ 671.777026] [] SyS_ioctl+0x81/0xa0 Mar 24 11:29:06 pandemonium kernel: [ 671.777028] [] ? do_page_fault+0x30/0x80 Mar 24 11:29:06 pandemonium kernel: [ 671.777032] [] system_call_fastpath+0x16/0x75 Mar 24 11:29:06 pandemonium kernel: [ 671.777034] Code: 00 0f 1f 44 00 00 55 31 c0 48 89 e5 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 8b 46 f0 48 89 e5 8b 80 f4 01 00 00 48 03 86 e0 00 00 00 <48> 89 10 48 89 48 08 31 c0 5d c3 66 66 2e 0f 1f 84 00 00 00 00 Mar 24 11:29:06 pandemonium kernel: [ 671.777061] RIP [] set_trap_handler+0x1a/0x30 [amdkfd] Mar 24 11:29:06 pandemonium kernel: [ 671.777068] RSP Mar 24 11:29:06 pandemonium kernel: [ 671.777069] CR2: ffffc90019ecd000 Mar 24 11:29:06 pandemonium kernel: [ 671.777072] ---[ end trace 53807749a7eb2ed3 ]---

Should I go back to the 1/25 version of driver/runtime with my local patch or is this likely to be fixed? I'm happy to provide more info if need be.

mattmacy commented 8 years ago

I created an issue in with ROCK as that is probably where the current problem belongs.

aditya4d commented 8 years ago

Hi, Make sure you install debian files for ROCK dev branch. For runtime, make sure to install ROCR dev branch. Test the sample in /opt/hsa/sample.

If your sample is not passing, ROCR or ROCK is not working as it should be.

If it pass, get compiler (HCC and LLVM), follow https://github.com/RadeonOpenCompute/LLVM-AMDGPU-Assembler-Extra. Make sure you run conformance test given in the wiki for the repo.

Then, add /opt/hsa to HSA_PATH, /opt/hcc to HCC_PATH. Do the same for adding bin directories to PATH and lib to LD_LIBRARY_PATH.

Get hip and add its project directory to HIP_PATH and hipcc directory to PATH.

mattmacy commented 8 years ago

See previous comment "OK. I updated both the kernel and the runtime to the 316 build." That's the dev kernel. I also installed the dev runtime so that hip_hcc.cpp will compile with the ROCR_V2 copy interface. And that is what is causing this panic.

mattmacy commented 8 years ago

My sample passed fine until I tried the latest kernel and runtime. So all the other options are correct.

aditya4d commented 8 years ago

Can you try running hsa sample?

mattmacy commented 8 years ago

I'm no longer able to boot the dev kernel. It also complains of not properly detecting my graphics hardware - so needs to run in low-resolution, but instead never displays a login prompt. I'm not sure what I need to do to recover at this point. The default ubuntu kernel still works OK.

mattmacy commented 8 years ago

Looking at the logs It seems I'm seeing further OOPS at boot now: Mar 24 11:53:42 pandemonium rsyslogd: rsyslogd's userid changed to 104 Mar 24 11:53:43 pandemonium kernel: [ 13.184444] NVRM: Your system is not currently configured to drive a VGA console Mar 24 11:53:43 pandemonium kernel: [ 13.184446] NVRM: on the primary VGA device. The NVIDIA Linux graphics driver Mar 24 11:53:43 pandemonium kernel: [ 13.184447] NVRM: requires the use of a text-mode VGA console. Use of other console Mar 24 11:53:43 pandemonium kernel: [ 13.184448] NVRM: drivers including, but not limited to, vesafb, may result in Mar 24 11:53:43 pandemonium kernel: [ 13.184449] NVRM: corruption and stability problems, and is not supported. Mar 24 11:53:43 pandemonium kernel: [ 13.312533] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 Mar 24 11:53:43 pandemonium kernel: [ 13.312537] IP: [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu] Mar 24 11:53:43 pandemonium kernel: [ 13.312554] PGD 0 Mar 24 11:53:43 pandemonium kernel: [ 13.312556] Oops: 0000 [#1] SMP Mar 24 11:53:43 pandemonium kernel: [ 13.312558] Modules linked in: vmw_vsock_vmci_transport vsock vmw_vmci b43 mac80211 intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp eeepc_wmi asus_wmi sparse_keymap cfg80211 video kvm ssb mxm_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_codec_realtek snd_hda_codec_generic aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_codec_hdmi serio_raw snd_hda_intel sb_edac snd_hda_controller nvidia(POE) snd_hda_codec edac_core snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd lpc_ich bcma mei_me soundcore mei shpchp bnep wmi bluetooth tpm_infineon mac_hid binfmt_misc parport_pc ppdev lp parport nls_iso8859_1 amdkfd amd_iommu_v2 amdgpu psmouse amd_gnb_bus i2c_algo_bit ttm drm_kms_helper e1000e ahci libahci drm ptp pps_core Mar 24 11:53:43 pandemonium kernel: [ 13.312584] CPU: 1 PID: 1335 Comm: Xorg Tainted: P OE 4.1.0-201603162000-kfd-build-obsidian-82-generic #82 Mar 24 11:53:43 pandemonium kernel: [ 13.312586] Hardware name: iXsystems CSE-COR-AIR540/RAMPAGE V EXTREME, BIOS 1902 12/18/2015 Mar 24 11:53:43 pandemonium kernel: [ 13.312587] task: ffff880fcf760000 ti: ffff880ff34d4000 task.ti: ffff880ff34d4000 Mar 24 11:53:43 pandemonium kernel: [ 13.312587] RIP: 0010:[] [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu] Mar 24 11:53:43 pandemonium kernel: [ 13.312595] RSP: 0018:ffff880ff34d7c68 EFLAGS: 00010286 Mar 24 11:53:43 pandemonium kernel: [ 13.312596] RAX: ffff880ff17a83c0 RBX: ffff880ff0d7d400 RCX: 0000000000000001 Mar 24 11:53:43 pandemonium kernel: [ 13.312597] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff880ff17a8cc0 Mar 24 11:53:43 pandemonium kernel: [ 13.312597] RBP: ffff880ff34d7cd8 R08: 000000000001a920 R09: ffff880ff17a83c0 Mar 24 11:53:43 pandemonium kernel: [ 13.312598] R10: ffffffffc01cc11d R11: 00000000c0186443 R12: ffff880ff4a239c0 Mar 24 11:53:43 pandemonium kernel: [ 13.312599] R13: ffff880ff31ee110 R14: ffff880ff5c16200 R15: ffff880ff06f0000 Mar 24 11:53:43 pandemonium kernel: [ 13.312599] FS: 00007f705b545980(0000) GS:ffff880fff240000(0000) knlGS:0000000000000000 Mar 24 11:53:43 pandemonium kernel: [ 13.312600] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 24 11:53:43 pandemonium kernel: [ 13.312601] CR2: 0000000000000010 CR3: 0000000ff6cd5000 CR4: 00000000001407e0 Mar 24 11:53:43 pandemonium kernel: [ 13.312602] Stack: Mar 24 11:53:43 pandemonium kernel: [ 13.312602] ffff880ff34d7cd8 ffff880fe0b0c400 ffff880fe0b0c000 ffff880fe0b0c800 Mar 24 11:53:43 pandemonium kernel: [ 13.312604] 0000000200000000 ffff880ff382ee00 ffff880ff17a83c0 0000000000000002 Mar 24 11:53:43 pandemonium kernel: [ 13.312605] 0000000000000000 ffff880ff31ee110 ffff880ff382ee00 ffff880ff4a239c0 Mar 24 11:53:43 pandemonium kernel: [ 13.312606] Call Trace: Mar 24 11:53:43 pandemonium kernel: [ 13.312615] [] amdgpu_bo_list_ioctl+0x279/0x3f0 [amdgpu] Mar 24 11:53:43 pandemonium kernel: [ 13.312623] [] drm_ioctl+0x379/0x6a0 [drm] Mar 24 11:53:43 pandemonium kernel: [ 13.312630] [] ? amdgpu_bo_list_free+0x90/0x90 [amdgpu] Mar 24 11:53:43 pandemonium kernel: [ 13.312635] [] amdgpu_drm_ioctl+0x4b/0x80 [amdgpu] Mar 24 11:53:43 pandemonium kernel: [ 13.312638] [] do_vfs_ioctl+0x2f8/0x510 Mar 24 11:53:43 pandemonium kernel: [ 13.312640] [] ? __do_page_fault+0x1b6/0x450 Mar 24 11:53:43 pandemonium kernel: [ 13.312642] [] SyS_ioctl+0x81/0xa0 Mar 24 11:53:43 pandemonium kernel: [ 13.312643] [] ? do_page_fault+0x30/0x80 Mar 24 11:53:43 pandemonium kernel: [ 13.312645] [] system_call_fastpath+0x16/0x75 Mar 24 11:53:43 pandemonium kernel: [ 13.312646] Code: 00 00 48 8b bb 08 01 00 00 e8 57 67 fe ff 84 c0 0f 85 2f ff ff ff 8b 45 b0 8d 48 01 48 8d 04 80 48 c1 e0 04 48 03 45 c0 48 8b 10 <8b> 52 10 83 fa 04 89 50 30 0f 84 2b 01 00 00 89 50 34 48 89 18 Mar 24 11:53:43 pandemonium kernel: [ 13.312659] RIP [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu] Mar 24 11:53:43 pandemonium kernel: [ 13.312665] RSP Mar 24 11:53:43 pandemonium kernel: [ 13.312666] CR2: 0000000000000010 Mar 24 11:53:43 pandemonium kernel: [ 13.312667] ---[ end trace df8106f7c32c327f ]--- Mar 24 11:53:43 pandemonium nvidia-persistenced: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced Mar 24 11:53:43 pandemonium nvidia-persistenced: Shutdown (1372) Again moments later:

Mar 24 11:53:48 pandemonium nvidia-persistenced: Started (1640) Mar 24 11:53:49 pandemonium kernel: [ 19.067885] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 Mar 24 11:53:49 pandemonium kernel: [ 19.067889] IP: [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu] Mar 24 11:53:49 pandemonium kernel: [ 19.067905] PGD 0 Mar 24 11:53:49 pandemonium kernel: [ 19.067907] Oops: 0000 [#2] SMP Mar 24 11:53:49 pandemonium kernel: [ 19.067908] Modules linked in: hid_logitech_hidpp snd_usb_audio hid_logitech_dj snd_usbmidi_lib btusb btbcm btintel hid_generic usbhid hid vmw_vsock_vmci_transport vsock vmw_vmci b43 mac80211 intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp eeepc_wmi asus_wmi sparse_keymap cfg80211 video kvm ssb mxm_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_codec_real tek snd_hda_codec_generic aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_codec_hdmi serio_raw snd_hda_intel sb_edac snd_hda_controller nvidia(POE) snd_hda_codec edac_core snd_hda_core snd_hwdep snd_pcm snd _seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd lpc_ich bcma mei_me soundcore mei shpchp bnep wmi bluetooth tpm_infineon mac_hid binfmt_misc parport_pc ppdev lp parport nls_iso8859_1 amdkfd a md_iommu_v2 amdgpu psmouse amd_gnb_bus i2c_algo_bit ttm drm_kms_helper e1000e ahci libahci drm ptp pps_core Mar 24 11:53:49 pandemonium kernel: [ 19.067935] CPU: 0 PID: 1637 Comm: Xorg Tainted: P D OE 4.1.0-201603162000-kfd-build-obsidian-82-generic #82 Mar 24 11:53:49 pandemonium kernel: [ 19.067937] Hardware name: iXsystems CSE-COR-AIR540/RAMPAGE V EXTREME, BIOS 1902 12/18/2015 Mar 24 11:53:49 pandemonium kernel: [ 19.067938] task: ffff880fcf79c670 ti: ffff880ff6354000 task.ti: ffff880ff6354000 Mar 24 11:53:49 pandemonium kernel: [ 19.067939] RIP: 0010:[] [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu] Mar 24 11:53:49 pandemonium kernel: [ 19.067946] RSP: 0018:ffff880ff6357c68 EFLAGS: 00010286 Mar 24 11:53:49 pandemonium kernel: [ 19.067947] RAX: ffff880ff14b9cc0 RBX: ffff880ff61dc800 RCX: 0000000000000001 Mar 24 11:53:49 pandemonium kernel: [ 19.067948] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff880ff14b9d80 Mar 24 11:53:49 pandemonium kernel: [ 19.067948] RBP: ffff880ff6357cd8 R08: 000000000001a920 R09: ffff880ff14b9cc0 Mar 24 11:53:49 pandemonium kernel: [ 19.067949] R10: ffffffffc01cc11d R11: 00000000c0186443 R12: ffff880ff4370780 Mar 24 11:53:49 pandemonium kernel: [ 19.067950] R13: ffff880fefd27d90 R14: ffff880ff0ef0700 R15: ffff880ff06f0000 Mar 24 11:53:49 pandemonium kernel: [ 19.067951] FS: 00007fb06dc05980(0000) GS:ffff880fff200000(0000) knlGS:0000000000000000 Mar 24 11:53:49 pandemonium kernel: [ 19.067952] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 24 11:53:49 pandemonium kernel: [ 19.067952] CR2: 0000000000000010 CR3: 0000000ff61e0000 CR4: 00000000001407f0 Mar 24 11:53:49 pandemonium kernel: [ 19.067953] Stack: Mar 24 11:53:49 pandemonium kernel: [ 19.067954] ffff880ff6357cd8 ffff880fe0b0c400 ffff880fe0b0c000 ffff880fe0b0c800 Mar 24 11:53:49 pandemonium kernel: [ 19.067955] 0000000200000000 ffff880ff2704e00 ffff880ff14b9cc0 0000000000000002 Mar 24 11:53:49 pandemonium kernel: [ 19.067956] 0000000000000000 ffff880fefd27d90 ffff880ff2704e00 ffff880ff4370780 Mar 24 11:53:49 pandemonium kernel: [ 19.067958] Call Trace: Mar 24 11:53:49 pandemonium kernel: [ 19.067967] [] amdgpu_bo_list_ioctl+0x279/0x3f0 [amdgpu] Mar 24 11:53:49 pandemonium kernel: [ 19.067976] [] drm_ioctl+0x379/0x6a0 [drm] Mar 24 11:53:49 pandemonium kernel: [ 19.067983] [] ? amdgpu_bo_list_free+0x90/0x90 [amdgpu] Mar 24 11:53:49 pandemonium kernel: [ 19.067988] [] amdgpu_drm_ioctl+0x4b/0x80 [amdgpu] Mar 24 11:53:49 pandemonium kernel: [ 19.067991] [] do_vfs_ioctl+0x2f8/0x510 Mar 24 11:53:49 pandemonium kernel: [ 19.067993] [] ? __do_page_fault+0x1b6/0x450 Mar 24 11:53:49 pandemonium kernel: [ 19.067995] [] SyS_ioctl+0x81/0xa0 Mar 24 11:53:49 pandemonium kernel: [ 19.067996] [] ? do_page_fault+0x30/0x80 Mar 24 11:53:49 pandemonium kernel: [ 19.067998] [] system_call_fastpath+0x16/0x75 Mar 24 11:53:49 pandemonium kernel: [ 19.067999] Code: 00 00 48 8b bb 08 01 00 00 e8 57 67 fe ff 84 c0 0f 85 2f ff ff ff 8b 45 b0 8d 48 01 48 8d 04 80 48 c1 e0 04 48 03 45 c0 48 8b 10 <8b> 52 10 83 fa 04 89 50 30 0f 84 2b 01 00 00 89 50 34 48 89 18 Mar 24 11:53:49 pandemonium kernel: [ 19.068012] RIP [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu] Mar 24 11:53:49 pandemonium kernel: [ 19.068019] RSP Mar 24 11:53:49 pandemonium kernel: [ 19.068019] CR2: 0000000000000010 Mar 24 11:53:49 pandemonium kernel: [ 19.068031] ---[ end trace df8106f7c32c3280 ]--- Mar 24 11:53:49 pandemonium nvidia-persistenced: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced Mar 24 11:53:49 pandemonium nvidia-persistenced: Shutdown (1640) Mar 24 11:53:54 pandemonium nvidia-persistenced: Started (1703) Mar 24 11:53:54 pandemonium nvidia-persistenced: Failed to open PID file: File exists Mar 24 11:53:54 pandemonium nvidia-persistenced: Shutdown (1710) Mar 24 11:53:54 pandemonium nvidia-persistenced: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced Mar 24 11:53:54 pandemonium nvidia-persistenced: Shutdown (1703) Mar 24 11:53:54 pandemonium nvidia-persistenced: Started (1741) Mar 24 11:53:55 pandemonium kernel: [ 25.105818] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 Mar 24 11:53:55 pandemonium kernel: [ 25.105822] IP: [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu] Mar 24 11:53:55 pandemonium kernel: [ 25.105838] PGD 0 Mar 24 11:53:55 pandemonium kernel: [ 25.105839] Oops: 0000 [#3] SMP Mar 24 11:53:55 pandemonium kernel: [ 25.105841] Modules linked in: hid_logitech_hidpp snd_usb_audio hid_logitech_dj snd_usbmidi_lib btusb btbcm btintel hid_generic usbhid hid vmw_vsock_vmci_transport vsock vmw_vmci b43 mac80211 intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp eeepc_wmi asus_wmi sparse_keymap cfg80211 video kvm ssb mxm_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_codec_realtek snd_hda_codec_generic aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_codec_hdmi serio_raw snd_hda_intel sb_edac snd_hda_controller nvidia(POE) snd_hda_codec edac_core snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd lpc_ich bcma mei_me soundcore mei shpchp bnep wmi bluetooth tpm_infineon mac_hid binfmt_misc parport_pc ppdev lp parport nls_iso8859_1 amdkfd amd_iommu_v2 amdgpu psmouse amd_gnb_bus i2c_algo_bit ttm drm_kms_helper e1000e ahci libahci drm ptp pps_core Mar 24 11:53:55 pandemonium kernel: [ 25.105869] CPU: 0 PID: 1738 Comm: Xorg Tainted: P D OE 4.1.0-201603162000-kfd-build-obsidian-82-generic #82 Mar 24 11:53:55 pandemonium kernel: [ 25.105870] Hardware name: iXsystems CSE-COR-AIR540/RAMPAGE V EXTREME, BIOS 1902 12/18/2015 Mar 24 11:53:55 pandemonium kernel: [ 25.105871] task: ffff880034a11e30 ti: ffff8800a6aec000 task.ti: ffff8800a6aec000 Mar 24 11:53:55 pandemonium kernel: [ 25.105872] RIP: 0010:[] [] amdgpu_bo_list_set+0x196/0x3d0 [amdgpu] Mar 24 11:53:55 pandemonium kernel: [ 25.105880] RSP: 0018:ffff8800a6aefc68 EFLAGS: 00010286 Mar 24 11:53:55 pandemonium kernel: [ 25.105881] RAX: ffff880fefd90f00 RBX: ffff880ff0e2d800 RCX: 0000000000000001 Mar 24 11:53:55 pandemonium kernel: [ 25.105881] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff880fefd90840 Mar 24 11:53:55 pandemonium kernel: [ 25.105882] RBP: ffff8800a6aefcd8 R08: 000000000001a920 R09: ffff880fefd90f00 Mar 24 11:53:55 pandemonium kernel: [ 25.105883] R10: ffffffffc01cc11d R11: 00000000c0186443 R12: ffff880ff47c9120 Mar 24 11:53:55 pandemonium kernel: [ 25.105883] R13: ffff880fefd27e90 R14: ffff880ff0d68100 R15: ffff880ff06f0000 Mar 24 11:53:55 pandemonium kernel: [ 25.105884] FS: 00007fbc416d6980(0000) GS:ffff880fff200000(0000) knlGS:0000000000000000 Mar 24 11:53:55 pandemonium kernel: [ 25.105885] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 24 11:53:55 pandemonium kernel: [ 25.105886] CR2: 0000000000000010 CR3: 0000000ff6ea4000 CR4: 00000000001407f0 Mar 24 11:53:55 pandemonium kernel: [ 25.105886] Stack: Mar 24 11:53:55 pandemonium kernel: [ 25.105887] ffff8800a6aefcd8 ffff880fe0b0c400 ffff880fe0b0c000 ffff880fe0b0c800 Mar 24 11:53:55 pandemonium kernel: [ 25.105889] 0000000200000000 ffff880ff363b400 ffff880fefd90f00 0000000000000002 Mar 24 11:53:55 pandemonium kernel: [ 25.105890] 0000000000000000 ffff880fefd27e90 ffff880ff363b400 ffff880ff47c9120 Mar 24 11:53:55 pandemonium kernel: [ 25.105891] Call Trace: Mar 24 11:53:55 pandemonium kernel: [ 25.105900] [] amdgpu_bo_list_ioctl+0x279/0x3f0 [amdgpu]

And so on for all cpus.

jedwards-AMD commented 8 years ago

Do you have the GTX 980Ti, the R9 Nano and the R9 Fury all installed in the same system? If so, did you install the drivers for the GTX card before or after you installed the ROCK packages?

mattmacy commented 8 years ago

They're all in the same system. I installed the GTX card a couple of weeks ago. The R9s date back to yesterday. I have made no changes to the Nvidia software/hardware configuration in a couple of weeks - i.e. well before doing anything with AMD.

mattmacy commented 8 years ago

Attaching system profile info.

lspci.txt dmesg.txt lsmod.txt

mattmacy commented 8 years ago

The current status AFAICT is that the development driver won't work except in console-mode because Xorg's probing causes it to crash. So can anyone give me an ETA on when that will be fixed on github?

Thanks.

aditya4d commented 8 years ago

Hi, You can revert back to a previous release commit.

mattmacy commented 8 years ago

It's not clear to me where the problem was introduced. Can you hazard a guess at which changeset to try? The last time packages were updated was Jan 26th which corresponds to what's in master. So I'll need to build my own kernel - which is fine with me provided Kconfig is complete.

aditya4d commented 8 years ago

Hi, You can try master branch package. obsidian 62 (if you want to run hcc badly).

mangupta commented 8 years ago

Closing this since the original issue should be occurring anymore.

@mattmacy Please try with a clean setup and reopen the issue if you face any problems.