psychocrypt commented 6 years ago

All results create on a system with the ROCm OpenCL stack are invalid.

After investing a few days into debugging I would say the compiler is creating fully broken code. The code runs well on AMD system with 17.X driver, NVIDIA (OpenCl) and Intel.

I added a lot of debug output and by changing a view code lines I get my example running but in real tests the results are still wrong. It looks like there are many race conditions in the code which can sometimes solve by moving code lines around. It sounds like https://github.com/RadeonOpenCompute/ROCm/issues/500

The used system is a CentOs 7

rpm --query centos-release
centos-release-7-4.1708.el7.centos.x86_64

=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (number of timestamp)
Machine Model:           LARGE                              
System Endianness:       LITTLE  
# ... removed
*******                  
Agent 3                  
*******                  
  Name:                    gfx803                             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128                                
  Queue Min Size:          4096                               
  Queue Max Size:          131072                             
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16KB                               
  Chip ID:                 29440                              
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):975                                
  BDFID:                   45312                              
  Compute Unit:            64                                 
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64                                 
  Workgroup Max Size:      1024                               
  Workgroup Max Size Per Dimension:
    Dim[0]:                  67109888                           
    Dim[1]:                  2969568256                         
    Dim[2]:                  0                                  
  Grid Max Size:           4294967295                         
  Waves Per CU:            40                                 
  Max Work-item Per CU:    2560                               
  Grid Max Size per Dimension:
    Dim[0]:                  4294967295                         
    Dim[1]:                  4294967295                         
    Dim[2]:                  4294967295                         
  Max number Of fbarriers Per Workgroup:32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4194304KB                          
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64KB                               
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Dimension: 
        Dim[0]:                  67109888                           
        Dim[1]:                  1024                               
        Dim[2]:                  16777217                           
      Workgroup Max Size:      1024                               
      Grid Max Dimension:      
        x                        4294967295                         
        y                        4294967295                         
        z                        4294967295                         
      Grid Max Size:           4294967295                         
      FBarrier Max Size:       32

# some information from clinfo
Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 2.1 AMD-APP.internal (2617.0)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_object_metadata cl_amd_event_callback 
# ...
  Platform ID:                   0x2aaaafd048d0
  Name:                      gfx803
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 2.0 
  Driver version:                2617.0 (HSA1.1,LC)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 1.2

$ which clang++
/opt/rocm/hcc/bin/clang++
$ clang-7.0 --version
HCC clang version 7.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 86791fc4961dc8ffde77bde20d7dfa5e5cbeff5e) (ssh://gerritgit/compute/ec/hcc-tot/llvm 0ccef158132e1222d549edf2da33d4bc0be6c2d1) (based on HCC 1.2.18184-74f5fa9-86791fc-0ccef15 )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/hcc/bin

psychocrypt commented 6 years ago

@gstoner could you somehow help with this issue?

psychocrypt commented 6 years ago

I tested https://github.com/RadeonOpenCompute/ROCm/issues/500 on the rocm stack I use and it is also wrong: testA is wrong and testB is correct

psychocrypt commented 6 years ago

I manipulated the rocm hello world https://github.com/RadeonOpenCompute/ROCm/blame/master/README.md#L204 and changed the kernel:

/* Compiler options:
-c -emit-llvm -target amdgcn-amd-amdhsa-amdgizcl -x cl -cl-kernel-arg-info  -cl-std=CL1.2 -O3 -mcpu=fiji -mllvm -amdgpu-early-inline-all -mllvm -amdgpu-prelink -D__OPENCL_VERSION__=120 -D__IMAGE_SUPPORT__=1 -Xclang -cl-ext=+cl_khr_fp64,+cl_khr_global_int32_base_atomics,+cl_khr_global_int32_extended_atomics,+cl_khr_local_int32_base_atomics,+cl_khr_local_int32_extended_atomics,+cl_khr_int64_base_atomics,+cl_khr_int64_extended_atomics,+cl_khr_3d_image_writes,+cl_khr_byte_addressable_store,+cl_khr_fp16,+cl_khr_gl_sharing,+cl_amd_device_attribute_query,+cl_amd_media_ops,+cl_amd_media_ops2,+cl_khr_subgroups,+cl_amd_copy_buffer_p2p,+cl_amd_assembly_program -include opencl-c.h
*/

__kernel void hello_kernel(
    __global  int *a,
    __global  ulong *b,
    __global uint *result
)
{
    int gid = get_global_id(0);

    if (get_global_id(0) == 0) {
        uint n = a[1];
        a[1] = 0;  // broken version result will be zero
        result[0] += n;
        //a[1] = 0;  // correct version
    }
}

broken isa

// Disassembly:
        s_load_dword s2, s[4:5], 0x4                               // 000000001100: C0020082 00000004
        s_load_dwordx2 s[0:1], s[6:7], 0x18                        // 000000001108: C0060003 00000018
        s_waitcnt lgkmcnt(0)                                       // 000000001110: BF8C007F
        s_and_b32 s2, s2, 0xffff                                   // 000000001114: 8602FF02 0000FFFF
        s_mul_i32 s8, s8, s2                                       // 00000000111C: 92080208
        v_add_u32_e32 v0, vcc, s8, v0                              // 000000001120: 32000008
        v_sub_u32_e32 v0, vcc, 0, v0                               // 000000001124: 34000080
        v_subb_u32_e64 v1, s[2:3], 0, 0, vcc                       // 000000001128: D11D0201 01A90080
        v_cmp_eq_u64_e32 vcc, s[0:1], v[0:1]                       // 000000001130: 7DD40000
        s_and_saveexec_b64 s[0:1], vcc                             // 000000001134: BE80206A
        s_cbranch_execz BB0_2                                      // 000000001138: BF880016
BB0_1:
        s_load_dwordx2 s[0:1], s[6:7], 0x0                         // 00000000113C: C0060003 00000000
        s_load_dwordx2 s[2:3], s[6:7], 0x10                        // 000000001144: C0060083 00000010
        v_mov_b32_e32 v2, 0                                        // 00000000114C: 7E040280
        s_waitcnt lgkmcnt(0)                                       // 000000001150: BF8C007F
        s_load_dword s4, s[0:1], 0x4                               // 000000001154: C0020100 00000004
        s_add_u32 s0, s0, 4                                        // 00000000115C: 80008400
        s_addc_u32 s1, s1, 0                                       // 000000001160: 82018001
        v_mov_b32_e32 v0, s0                                       // 000000001164: 7E000200
        v_mov_b32_e32 v1, s1                                       // 000000001168: 7E020201
        flat_store_dword v[0:1], v2                                // 00000000116C: DC700000 00000200
        v_mov_b32_e32 v0, s2                                       // 000000001174: 7E000202
        v_mov_b32_e32 v1, s3                                       // 000000001178: 7E020203
        flat_load_dword v2, v[0:1]                                 // 00000000117C: DC500000 02000000
        s_waitcnt vmcnt(0) lgkmcnt(0)                              // 000000001184: BF8C0070
        v_add_u32_e32 v2, vcc, s4, v2                              // 000000001188: 32040404
        flat_store_dword v[0:1], v2                                // 00000000118C: DC700000 00000200
BB0_2:
        s_endpgm

correct code

// Disassembly:
        s_load_dword s2, s[4:5], 0x4                               // 000000001100: C0020082 00000004
        s_load_dwordx2 s[0:1], s[6:7], 0x18                        // 000000001108: C0060003 00000018
        s_waitcnt lgkmcnt(0)                                       // 000000001110: BF8C007F
        s_and_b32 s2, s2, 0xffff                                   // 000000001114: 8602FF02 0000FFFF
        s_mul_i32 s8, s8, s2                                       // 00000000111C: 92080208
        v_add_u32_e32 v0, vcc, s8, v0                              // 000000001120: 32000008
        v_sub_u32_e32 v0, vcc, 0, v0                               // 000000001124: 34000080
        v_subb_u32_e64 v1, s[2:3], 0, 0, vcc                       // 000000001128: D11D0201 01A90080
        v_cmp_eq_u64_e32 vcc, s[0:1], v[0:1]                       // 000000001130: 7DD40000
        s_and_saveexec_b64 s[0:1], vcc                             // 000000001134: BE80206A
        s_cbranch_execz BB0_2                                      // 000000001138: BF880017
BB0_1:
        s_load_dwordx2 s[0:1], s[6:7], 0x0                         // 00000000113C: C0060003 00000000
        s_load_dwordx2 s[2:3], s[6:7], 0x10                        // 000000001144: C0060083 00000010
        v_mov_b32_e32 v4, 0                                        // 00000000114C: 7E080280
        s_waitcnt lgkmcnt(0)                                       // 000000001150: BF8C007F
        s_load_dword s4, s[2:3], 0x0                               // 000000001154: C0020101 00000000
        s_load_dword s5, s[0:1], 0x4                               // 00000000115C: C0020140 00000004
        s_add_u32 s0, s0, 4                                        // 000000001164: 80008400
        s_addc_u32 s1, s1, 0                                       // 000000001168: 82018001
        v_mov_b32_e32 v3, s1                                       // 00000000116C: 7E060201
        v_mov_b32_e32 v0, s2                                       // 000000001170: 7E000202
        v_mov_b32_e32 v2, s0                                       // 000000001174: 7E040200
        s_waitcnt lgkmcnt(0)                                       // 000000001178: BF8C007F
        s_add_i32 s0, s4, s5                                       // 00000000117C: 81000504
        v_mov_b32_e32 v1, s3                                       // 000000001180: 7E020203
        v_mov_b32_e32 v5, s0                                       // 000000001184: 7E0A0200
        flat_store_dword v[0:1], v5                                // 000000001188: DC700000 00000500
        flat_store_dword v[2:3], v4                                // 000000001190: DC700000 00000402
BB0_2:
        s_endpgm

gstoner commented 6 years ago

Sent a note to compiler team

psychocrypt commented 6 years ago

@gstoner thx hope it get solved soon because I have the same issues on linux with the 18.3 driver. The 18.3 driver is using HSA and I think rocm and the backend used in 18.3 has the some issue.

gstoner commented 6 years ago

@psychocrypt Here an overview of the two OpenCL foundation currently on Linux,

the AMDGPU pro driver is using the LLVM to HSAIL compiler with Shader Compiler Backend, the same SC backed we use with Proprietary OpenGL, Also it is using the same userland backend we use on Windows, which is called PAL stands for Platform Abstraction Layer, Vulkan uses this. It also does Kernel mode dispatch for Queues, via PM4 Packets. SDMA is via the Kernal.

ROCm is OpenCL Runtime API mapping to the ROCr/Thunk which then talk IOCTL interface calls to base kernel driver. ROCr uses User Mode Queues based on AQL and User Mode SDMA which help with smaller packet transfer latency. ROCm stack supports larger Shared Virtual Memory Regions and also can support much larger memory allocation of kernels. AMDGPU LLVM compiler also supports Inline ASM and has Native Asmebler and Disassembler which historical compiler does not.
We also put out some new tools rocProfiler and rocTracer for performance analysis ( trace and Perf Counters) New cool thing announced is Anaconda & Numba is running on ROCm.

This is a good place to get more info on the ROCm compiler https://llvm.org/docs/AMDGPUUsage.html

The community has been great at helping out to make ROCm better. I know sometimes it feels slow but you are going to see the whole host new things coming. We are really working hard with the Linux Driver team to improve build development test flow and improving test infrastructure. The big one is we are finally scaling up the ROCm engineering team finally.

psychocrypt commented 6 years ago

Thank you for the overview about the software stack. Is my basic esumtion correct that the amd-pro driver in 18.X using the same compiler. Because I see with the same OpenCL test code wring code on my AMD RX570 on Linux. For that I will create a full example in the next days and will post the code in an separate issue.

gstoner commented 6 years ago

18.20 and newer AMDGPUpro driver use the using LLVM to HSAIL compiler

gstoner commented 6 years ago

only rocm uses the compiler with reference I sent you

BlackwidowZA commented 6 years ago

Any update on when to expect a fix?

gstoner commented 6 years ago

I working with the team to see if they load this into the next release. They just did patch also for the DKMS KCL 1.8.3. Also, I see you're seeing this on our AMDGPUpro driver as well which using a different compiler.

psychocrypt commented 6 years ago

Yes AMD pro driver is also showing the behaviour. Feel free to test also xmr-stak to see if a provided patch solves it for complex code.

psychocrypt commented 6 years ago

@gstoner Do you know when we can assume a new amd pro driver for linux. I have now the problem that I need to test new algorithms but I am not able to install a AMD driver because only the broken 18.X is available. Maybe you have some information. I would like do avoid to downgrade my linux.

arkdlite commented 6 years ago

@psychocrypt Thanks for everything you do. Can we hope to bug fix? I have installed Linux 4.10 with AMDGPU-PRO 17.40 and xmr-stak works fine on this configuration. On Linux kernels younger than 4.10 it was many errors while DKMS installation of AMDGPU-PRO 17.40. On Linux 4.17/15 with AMDGPU-PRO 18.20/30 it was many compute errors in miner. Good luck in fixing this bug. Thank you!

gstoner commented 6 years ago

AMDGPUpro 17.40, OpenCL was ROCm based with the SC Compiler, 18.20 is when PAL was introduced for OpenCL. for AMDGPUpro driver. We are focusing on getting the ROCm driver and Opensource LLVM compiler clean XMR. I will ping the PAL team they have an issue with either the runtime or shader ( SC) compiler.

arkdlite commented 6 years ago

@gstoner OK, thanks. So, I will use ROCM driver when you fix this problem because I need new Linux kernel 4.17. Good luck!

Macribit commented 6 years ago

just in case someone else is digging for a solution here, you still need to delete the opencl cache when downgrading.

rm -dfr /home/user/.openclcache/

nioroso-x3 commented 5 years ago

I just installed ROCM 1.9 for a vega 56 on my personal desktop, xmr-stak is now working and generating valid shares. I look forward to ditching Windows on my mining rigs.

psychocrypt commented 5 years ago

I s ROCm 1.9 now available? For windows and linux?

calvintam236 commented 5 years ago

@psychocrypt Yes, run apt update && apt upgrade to get it.

psychocrypt commented 5 years ago

thx for the information. I will try it.

Mafaka8 commented 5 years ago

@nioroso-x3 what kinda hashrate are you getting on your 56 with xmr-stak and rocm 1.9 ?

nioroso-x3 commented 5 years ago

@Mafaka8 Still only 1000 hash per card max, more threads just divide the speed. Also tried with separate instances, and the same problem occurs. I hope they fix that bug soon.

Mafaka8 commented 5 years ago

its too bad ROCm 1.9 AMDGPU driver doesnt let us modify pp_mclk_od :(

gstoner commented 5 years ago

We just update our c based SMI library https://github.com/RadeonOpenCompute/rocm_smi_lib , The XMR team may want to look at it allow application to get lower latancy access to the GPU features

IronDusts commented 5 years ago

Do you have update on this topic?

On my Centos 7.5 I spent half day installing all kind of different kernels 3 and 4 including kernels from Centos 7.2-7.5 and with not a single one of them I managed to get the amdgpu-pro 17.40 or 17.50 drivers working. They all ended with dkms errors. The only driver I managed to make work was 18.30 (haven't try other 18s since they don't work) Can somebody advice exact version of rhel 7 kernel that will make 17.40/50 work? Or is there any hope on the PAL driver to work soon?

pacf531 commented 5 years ago

Just reposting some stuff I wrote in the ROCm issue tracker and adding some information in case it helps anyone and performance figures.

I installed rocm 1.9 on Linux kernel 4.18.7(now using 4.18.8 and still working) with without rock-dkms, which means I am using the upstream Linux vanilla interface (not sure if it makes a difference in performance). You can install every single other package included with ROCm 1.9 release except for rock-dkms and I find that using upstream Linux interfaces usually is better for stability purposes. At stock settings (mostly except for a maxed power limit of 220W on each OS) on a Vega Frontier Edition, I did a benchmark run of the compiled latest dev source code of xmr-stak for Windows and Linux and got the following performance figures.

Windows 10 with 18.5.1 drivers [2018-09-14 18:49:58] : Benchmark Thread 0 amd: 923.1 H/S [2018-09-14 18:49:58] : Benchmark Thread 1 amd: 939.6 H/S [2018-09-14 18:49:58] : Benchmark Total: 1862.7 H/S

Ubuntu 18.04 with Linux kernel 4.18.7 and ROCm 1.9 [2018-09-14 18:54:42] : Benchmark Thread 0 amd: 615.2 H/S [2018-09-14 18:54:42] : Benchmark Thread 1 amd: 600.1 H/S [2018-09-14 18:54:42] : Benchmark Total: 1215.3 H/S

Not sure what the difference is considering the OpenCL compiler should be roughly the same between both drivers. Maybe it is something with the backend on Windows vs Linux.

So yeah, all improvements considered (I believe previously with 1.8, the max was around 1000 H/s with custom overclocks?), there is still a 33% difference between both OSes. I haven't done extensive testing yet and what worries me is mainly stability where according to the thread, it works at stock but if I use a custom powerplay table on Linux, would it throw invalid shares? I do also want to see how far it gets with overclocking since on Windows now I can hash somewhere in the 2100-2200 H/s range if I don't care for power at all.

gstoner commented 5 years ago

@pacf531 Could you run AMDGPUpro 18.30 for the same test. It uses the Same OpenCL runtime and Userland as Windows, but a different kernel driver. They both use same base Compiler based HSAIL/SC ( Shader Compiler) as windows.

ROCm is a newer compiler that is based on LLVM. and different base runtime technology.

Mafaka8 commented 5 years ago

On Ubuntu 18.04 with Kernel 4.15.0-34-generic 18.30 results in a good 30% invalid shares when using xmr-stak. The odd thing is that this does not occur when using other mining applications.

nioroso-x3 commented 5 years ago

I left it running for some hours, and 22% of the results are invalids. How can it be that hard for AMD to get Vega working right on Linux? They will never compete with cuda and nvidia this way.

IronDusts commented 5 years ago

Are you mining with CPU too? Is it possible that AMD is giving 0% and these 22% are from your CPU?

Mafaka8 commented 5 years ago

My tests are with no cpu mining. Just GPU. and I end up with 30% invalid shares for all AMD cards that I have tried testing, RX 500 series, RX 400 series, and Vega FEs.

I did manage to get things running nicely for a vega 56 on one of my rigs. Its getting 1675 h/s with zero invalid results.

Still not 2000 BUT its much much better than the 1200 h/s that I was getting prior to 18.30. the problem with 18.30 is that I cannot overclock the mem with pp_mclk_od.

That machine with the vega 56 is mining with CPU as well and no invlid results. The one difference is that I installed amdgpu 18.30 --headless --opencl=pal,legacy and am using the integrated intel card for display purposes.

The machine with the FEs that gets all the invalid results does not have an integrated GPU.

chenhan12344 commented 5 years ago

I managed to make my vegas works well. I have a vega 56 and vega FE, they both got around 2000/s using 18.30 driver with zero invalid results. Here's what I've done to make them work.

First compile the xmr-stak. Then uninstall amdgpu-pro driver if your driver version is newer than 17.50. Also follow the guide in https://github.com/RadeonOpenCompute/ROCm to remove ROCm from your machine if you have installed any version of the ROCm software. Make sure you have removed the openCL binary compiled by the miner previously. My openCL binary is at /root/.openclcache.

Second, download and install the amdgpu-pro 17.50 driver --opencl=rocm. After reboot, use clinfo to check that the driver is correctly installed. Run xmr-stak. At this time, the miner should work and will not generate any invalid result. However, I've only got around 1200h/s for each of my vega even with the mem overclocked to 1100MHz. And here's I've done to improve the hashrate.

Third, uninstall the amdgpu-pro 17.50 driver but do not remove the openCL binary compiled by the miner! Reboot and install the the amdgpu-pro 18.30 driver --opencl=pal --headless. Reboot after the installation.

Now run the miner and check the hashrate. I am able to get around 2000h/s per card at this time with zero invalid result. I also checked from my pool and the pool did show that my vegas do got 2000h/s.

I am not sure this will work or not on your guys' machine, just try it.

psychocrypt commented 5 years ago

This workaround will not work any longer with the next release. we brand the binaries with driver information because an driver update crashes very often older binaries ( not in your case) you workaround the broken opencl compiler shipped with the driver 18.30 What I can see within your post is that the speed issue with vega is solved in the driver 18.30. But since the compiler is broken is is useless. Maybe we can install the driver 18.3 and change the compiler binaries to the old version.

HarlemSquirrel commented 5 years ago

Any ideas on how to get ROCm working on ArchLinux? https://github.com/RadeonOpenCompute/ROCm/issues/294#issuecomment-421534060

Spudz76 commented 5 years ago

@HarlemSquirrel No personal experience however, that error is what I get on other distros (Ubuntu, Debian) when I am not root and/or do not have my device nodes (udev rules) set to be group video and writable by anyone in such group.

But even then sometimes it doesn't want to work. So I just run as root and never looked back. So try sudo clinfo maybe it won't choke.

Other thing that breaks it is having more than just AMD in your platforms, if it hits Intel iGPU OpenCL ICD first it will choke and die (at least it does on Windows, I haven't tried Linux + iGPU enabled). So check /etc/OpenCL/vendors/ for ICD files and if there is anything but amdocl64.icd in there, destroy them

Install strace and see what it does just before exploding (strace shows what libraries and actual full path of files it is opening and everything). Just prepend like strace clinfo

Uninstall anything Mesa or Clover they crash things generally also.

Mafaka8 commented 5 years ago

@psychocrypt All I know is I hope you guys come up with a solution soon :)

@chenhan12344 I got things to work with zero invalid results as well. I followed your instructions except I used 18.10 and then moved to 18.30 without removing the binaries from .openclcache

I am now getting 1880 h/s from my FEs without any issues. 1880 h/s most likely because I underclock the core to #4 to run lower power consumption

1880 h/s at around 125 watts :) :)

plavirudar commented 5 years ago

@Mafaka8 I guess if there is no solution, you can always patch out the line that forces a recompile for each new xmr-stak version and merge the rest of the code.

pacf531 commented 5 years ago

@gstoner I ran the pro drivers and it does seem it is on par with Windows.

[2018-09-22 14:48:10] : Benchmark Thread 0 amd: 933.8 H/S [2018-09-22 14:48:10] : Benchmark Thread 1 amd: 915.1 H/S [2018-09-22 14:48:10] : Benchmark Total: 1848.9 H/S

That being said, I did try to see if I could reproduce the stability issues that everyone has been having here by running for at least 15 minutes to get readings which xmr-stak records. However, instead of invalid results everyone else is getting, I just get a straight up display crash around minute 6-10 with the following similar messages in my journalctl logs:


Sep 22 15:34:45 ubuntu kernel: gmc_v9_0_process_interrupt: 36 callbacks suppressed
Sep 22 15:34:45 ubuntu kernel: amdgpu 0000:21:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:6 pasid:32788, for process xmr-stak pid 3252 thread xmr-stak pid 3254)
Sep 22 15:34:45 ubuntu kernel: amdgpu 0000:21:00.0:   in page starting at address 0x00000004049ff000 from 27
Sep 22 15:34:45 ubuntu kernel: amdgpu 0000:21:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00601030

And gnome then shortly crashes afterwards, hence the display crash. Spent several hours using different kernels and modifying pp_od_voltage to underclock to no avail and I uninstalled the pro drivers after that. I then went to reinstall ROCm but decided to use rocm-dkms this time and not the vanilla kernel interfaces to see if it made a difference, and apparently, it did, actually outperforming both the Linux Pro driver and the Windows one by a tiny amount.

[2018-09-22 22:20:09] : Benchmark Thread 0 amd: 945.1 H/S [2018-09-22 22:20:09] : Benchmark Thread 1 amd: 929.0 H/S [2018-09-22 22:20:09] : Benchmark Total: 1874.2 H/S

Running a longer 15 minute benchmark, I get the following figures with CPU mining included, but since the post getting long, I am only going to include GPU hash results only and results here to keep it short. I do want to mention that automatic fan control is broken on Linux for now, so I am running everything with 100% fanspeed to keep temperatures down so on Linux, you don't get crazy color corruption when the display crashes due to the fan not turning on/ramping up which happened to me when doing nothing to the fans. Everything is running on a Vega Frontier Edition.

Windows 10 Adrenalin 18.5.1 stock clocks with 100% fans

HASHRATE REPORT - AMD | ID | 10s | 60s | 15m | ID | 10s | 60s | 15m | | 0 | 943.3 | 942.8 | 942.4 | 1 | 941.9 | 942.2 | 942.3 | Totals (AMD): 1885.2 1885.0 1884.7 H/s

RESULT REPORT Difficulty : 57180 Good results : 49 / 49 (100.0 %) Avg result time : 19.4 sec

Windows 10 Adrenalin 18.5.1 Custom Powerplay with 100% fans

HASHRATE REPORT - AMD | ID | 10s | 60s | 15m | ID | 10s | 60s | 15m | | 0 | 1079.2 | 1081.7 | 1084.8 | 1 | 1089.9 | 1086.8 | 1084.6 | Totals (AMD): 2169.2 2168.5 2169.4 H/s

RESULT REPORT Difficulty : 71430 Good results : 53 / 53 (100.0 %) Avg result time : 17.8 sec

Linux ROCm 1.9 with amd-staging-drm-next (~Linux 4.19 rc1+) stock with Wayland desktop and custom Powerplay with 100% fans

HASHRATE REPORT - AMD | ID | 10s | 60s | 15m | ID | 10s | 60s | 15m | | 0 | 944.1 | 945.0 | 944.1 | 1 | 947.6 | 945.3 | 944.1 | Totals (AMD): 1891.7 1890.4 1888.2 H/s

RESULT REPORT Difficulty : 43380 Good results : 49 / 57 (86.0 %) Avg result time : 19.9 sec

Error details: | Count | Error text | Last seen | | 8 | AMD Invalid Result GPU ID 0 | 2018-09-23 17:01:39 |

Linux ROCm 1.9 with amd-staging-drm-next (~Linux 4.19 rc1+) with Wayland desktop and custom Powerplay with 100% fans

HASHRATE REPORT - AMD | ID | 10s | 60s | 15m | ID | 10s | 60s | 15m | | 0 | 1089.9 | 1093.1 | 1094.0 | 1 | 1096.0 | 1093.4 | 1094.4 | Totals (AMD): 2185.9 2186.5 2188.4 H/s

Totals (ALL): 2185.9 2186.5 2188.4 H/s Highest: 2195.1 H/s

RESULT REPORT Difficulty : 60960 Good results : 39 / 46 (84.8 %) Avg result time : 24.2 sec

Error details: | Count | Error text | Last seen | | 7 | AMD Invalid Result GPU ID 0 | 2018-09-23 16:31:19 |

So the conclusion is that while ROCm has on par if not better hashing than Windows now, the invalid results being roughly 15% of the total pulls the hashrate down lower than Windows. But I would say the setup in getting ROCm up for mining purposes on Linux is already really good and probably smoother than Windows at this point. Unfortunately, that also means this issue should stay open for the time being while the invalid hashes issue gets resolved. Good work done though overall though so thank you and the team at AMD for getting this far. I am only one person though so I do wonder if anyone else has the same results as I do, if they can replicate my setup.

shimmervoid commented 5 years ago

@pacf531 I can vouch for this on Rocm 1.9 with the 4.18.7 Kernel.

HASHRATE REPORT - AMD | ID | 10s | 60s | 15m | ID | 10s | 60s | 15m | | 0 | 940.0 | 939.8 | (na) | 1 | 938.9 | 940.0 | (na) | | 2 | 939.4 | 939.0 | (na) | 3 | 938.2 | 939.0 | (na) | | 4 | 939.7 | 940.0 | (na) | 5 | 940.0 | 940.0 | (na) | | 6 | 944.2 | 943.4 | (na) | 7 | 942.7 | 943.8 | (na) | Totals (AMD): 7523.2 7525.0 0.0 H/s

Totals (ALL): 8819.1 8821.0 0.0 H/s Highest: 8823.9 H/s

RESULT REPORT Difficulty : 198006 Good results : 17 / 20 (85.0 %) Avg result time : 19.5 sec Pool-side hashes : 3366102

Top 10 best results found: | 0 | 738191530 | 1 | 1559310 | | 2 | 1087749 | 3 | 1061258 | | 4 | 540149 | 5 | 455522 | | 6 | 390028 | 7 | 337138 | | 8 | 316607 | 9 | 311471 |

having a fix for the invalid and some overdrive functions in smi, we would be golden.

Fingers Crossed!

gstoner commented 5 years ago

Did you see we released ROCm SMI LIb https://github.com/RadeonOpenCompute/rocm_smi_lib/blob/master/README.md

C++ Library interface for ROCm-SMI to allow you to monitor/trace GPU system attributes

For developer Familiar with NVML here is our key to understanding our API

NVML API	Type	Sub-type	Rocm-smi equivalent API
Power State
GetEnforcedPowerLimit	Device	Power	rsmi_dev_power_cap_set()
GetPowerManagementLimit	Device	Power	rsmi_dev_power_cap_get()
GetPowerManagementLimitConstraints	Device	Power	rsmi_dev_power_cap_range_get()
GetPowerUsage	Device	Power	rsmi_dev_power_ave_get()
SetPowerManagementLimit	Device	Power	rsmi_dev_power_cap_set()
Performance State
GetApplicationsClock	Device	Clocks	rsmi_dev_get_gpu_sys_freq() rsmi_dev_get_mem_sys_freq()
GetAutoBoostedClocksEnabled	Device	Clocks	rsmi_dev_perf_level_get()
GetClock	Device	Clocks	rsmi_dev_get_gpu_sys_freq(). rsmi_dev_get_mem_sys_freq()
GetClockInfo	Device	Clocks	rsmi_dev_get_gpu_sys_freq(), rsmi_dev_get_mem_sys_freq()
GetMaxClockInfo	Device	Clocks	rsmi_dev_get_gpu_sys_freq(), rsmi_dev_get_mem_sys_freq()
GetSupportedGraphicsClocks	Device	Clocks	rsmi_dev_get_gpu_sys_freq(), rsmi_dev_get_mem_sys_freq()
GetSupportedMemoryClocks	Device	Clocks	rsmi_dev_get_gpu_sys_freq() rsmi_dev_get_mem_sys_freq()
SetAutoBoostedClocksEnabled	Device	Clocks	rsmi_dev_perf_level_set()
GetFanSpeed	Device	Physcial	rsmi_dev_mon_get_fan_speed() rsmi_dev_mon_get_max_fan_speed() rsmi_dev_fan_rpms_get()
GetTemperature	Device	Physcial	rsmi_dev_temp_metric_get()
GetTemperatureThreshold.	Device	Physcial	rsmi_dev_temp_metric_get()
Initialization and Cleanup
Init
InitWithFlags	Mngt	Admin	rsmi_init()
Shutdown	Mngt	Admin	rsmi_shut_down()
Error Reporting
ErrorString	ErrRpt	Err. Rep	rsmi_status_string()
Unit Queries
UnitGetCount	Unit	Admin	rsmi_num_monitor_devices()
UnitGetFanSpeedInfo	Unit	Physcial	rsmi_dev_fan_rpms_get(), rsmi_dev_van_speed_get()
UnitGetTemperature	. Device	Physical	rsmi_dev_temp_metric_get()
UnitGetUnitInfo	Unit	ID.	rsmi_dev_id_get() (device ID)
Unit Commands
DeviceSetApplicationsClocks	Device	Clocks	rsmi_dev_gpu_clk_freq_set()
DeviceSetComputeMode	Device	Performance	rsmi_dev_power_profile_set() rsmi_dev_gpu_clk_freq_set() rsmi_dev_perf_level_set()
DeviceSetGpuOperationMode	Device	Power	rsmi_dev_power_profile_set()
DeviceSetPowerManagementLimit	Device	Power	rsmi_dev_power_cap_set() rsmi_dev_power_profile_set()
ROCm Only Clock Management API
	Device	Clocks	rsmi_dev_perf_level_get()
	Device	Clocks	rsmi_dev_perf_level_set()
	Device.	Clocks	rsmi_dev_overdrive_level_get()
	Device.	Clocks	rsmi_dev_overdrive_level_set()

ddobreff commented 5 years ago

Hi, At least rocm is now working with mainline kernel, but I can't seem to make it work with xmr-stak as expected on Vega64. Installed latest rocm-opencl but hashrate is same as before if not worse...windows has 2000+ while linux 1400 tops. Anyone willing to guide me/us on how to achieve same speeds on linux? Thanks.

enerc commented 5 years ago

As far as the 20% invalid results, that's what I got with a Vega 56:

libamdocl64.so (shipped with Rocm 1.9) -> KO
libamdocl64.so (shipped with 18.30) -> KO
clang/LLVM 4.0 (shipped with Rocm 1.9) -> KO
clang/LLVM 8.0 (compiled from Rocm LLVM/CLANG github) -> KO + takes a couple of minutes to compile the OpenCL source code extracted from a modified xmr-stak 2.4.7 to output the kernel source. clang 8.0 with -O3 has same speed as clang 4.0/libamdocl64.so. I compiled with /opt/rocm/bin/clang-ocl and with 8.0 files in /opt/rocm/opencl/bin/x86_64.
libamdocl64.so (shipped with 18.10 + rocm 1.9) -> OK!

What I did is just to extract all x86_64.rpm files from 18.10 and then move opt/amdgpu and opt/amdgpu-pro folders to /opt (no apt-get, no dpkg - just file copy). My conclusion on invalid results is that it comes from the compiler and not from the kernel stack.

As for speed, I tried about 10 kernels from 4.15.0-kfd (the one from Rocm github) to 4.18.8 to amd-drm-next-gen (4.19.0-rc1+) with/without rocks-dkms/rocm-dkms and the result is I always get around 1200 H/s at stock config.

As for stability:

18.30 pro dkms gives the famous VMC page fault immediately when starting an opencl kernel.
vanillia and 4.15.0-kfd kernels are rock stable.

So yes, someone explaining how to get the hashrate boost is welcomed!

ddobreff commented 5 years ago

I had a quick glimpse of 2120H/s, but then it started dropping until it died in infamous VM faults. Failed even on stock, so I guess the compiler is not the problem here since kernel was compiled with 18.10 already.

gstoner commented 5 years ago

We released low level debugger to see if it is application level or lower in the base driver

Get Outlook for iOShttps://aka.ms/o0ukef

From: Dobromir Dobrev notifications@github.com Sent: Friday, September 28, 2018 6:10 PM To: fireice-uk/xmr-stak Cc: Gregory Stoner; Mention Subject: Re: [fireice-uk/xmr-stak] ROCm OpenCL INVALID results (#1797)

I had a quick glimpse of 2120H/s, but then it started dropping until it died it infamous VM faults. Failed even on stock, so I guess the compiler is not the problem here since kernel was compiled with 18.10 already.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/fireice-uk/xmr-stak/issues/1797#issuecomment-425590573, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD8DuQFtHk2bFmdJzAf9xfxQBim7TlqCks5ufqxcgaJpZM4WFJHq.

psychocrypt commented 5 years ago

Solved with ROCM 1.9 and #1866

fireice-uk / xmr-stak

ROCm OpenCL INVALID results #1797

broken isa

correct code

Totals (ALL): 8819.1 8821.0 0.0 H/s Highest: 8823.9 H/s