fireice-uk / xmr-stak

Free Monero RandomX Miner and unified CryptoNight miner
GNU General Public License v3.0
4.05k stars 1.79k forks source link

Heavy performance drop with ROCM compiled kernels for Vega and Polaris #1973

Open nioroso-x3 opened 6 years ago

nioroso-x3 commented 6 years ago

On my desktop I have a Vega 56 with a hybrid ROCM user space + 18.30 dkms driver setup, the CNv7 kernel hashed fine at 1700+ hash/s, but now for CNv8 perf dropped to just 800 hash/s (400 per thread). On my asus gl702zc (integrated RX580 4GB) perf dropped from 800 to 600 hash/s.

Everything is working fine when using kernels compiled with amdgpu 18.10 on top of 18.30 userspace. (1600+ for CNv8 on vega, 770 for Polaris)

rumatadest commented 6 years ago

same Issue on vega 56, rocm 1.9.1 and xmr-stak 2.5.1 hashrate dropped to 400 per thread i changed config to : one thread, intensity: 2496, worksize: 16 and hashrate up to 1000 H/s maximim looks like xmr-stak can't see vega's full memoy size.

Spudz76 commented 6 years ago

@nioroso-x3 Use worksize 16 - don't use newer than 18.1 or if so, use ROCm 1.9+ only

@rumatadest Split your intensity back into two threads, the worksize is what brings back speed. >2000 intensity in a single thread is insane IMO and I'm surprised you even see a hashrate.

CN in general is not very memory size heavy (but it is memory bandwidth heavy), you only need 2GB regardless how much horsepower your GPU core processor has. Small 2MB blocksize. You can double mem utilization by switching to a CN-Heavy (4MB) coin, haha.

Unfortunately nobody makes a 6400-bit-wide superhighway to 2GB of VRAM which is what would rock for us, so you end up having to waste 6GB of silicon to get the newer (usually means wider) bandwidth designs. Or high as possible mem clock.

nioroso-x3 commented 6 years ago

@nioroso-x3 Use worksize 16 - don't use newer than 18.1 or if so, use ROCm 1.9+ only

@rumatadest Split your intensity back into two threads, the worksize is what brings back speed. >2000 intensity in a single thread is insane IMO and I'm surprised you even see a hashrate.

CN in general is not very memory size heavy (but it is memory bandwidth heavy), you only need 2GB regardless how much horsepower your GPU core processor has. Small 2MB blocksize. You can double mem utilization by switching to a CN-Heavy (4MB) coin, haha.

Unfortunately nobody makes a 6400-bit-wide superhighway to 2GB of VRAM which is what would rock for us, so you end up having to waste 6GB of silicon to get the newer (usually means wider) bandwidth designs. Or high as possible mem clock.

I'm already with latest rocm, worksize 16 didnt work, thought hashrate increased slightly to 536 per thread. Using 1536 intensity.

TheGoddessInari commented 6 years ago

This bug, as with #1964, is weird. On Vega FE, it doesn't matter if you use ROCm 1.9.1 or drop in the 18.10 libamdocl64.so, or indeed, if you use 18.9.1 or 18.5.1 on Windows. So good vs. bad OpenCL compiler/runtime doesn't seem to matter at all.

@nioroso-x3 Is there anything specific you're doing other than replacing libamdocl64.so to make it work faster with v8?

nioroso-x3 commented 6 years ago

This bug, as with #1964, is weird. On Vega FE, it doesn't matter if you use ROCm 1.9.1 or drop in the 18.10 libamdocl64.so, or indeed, if you use 18.9.1 or 18.5.1 on Windows. So good vs. bad OpenCL compiler/runtime doesn't seem to matter at all.

@nioroso-x3 Is there anything specific you're doing other than replacing libamdocl64.so to make it work faster with v8?

I don't replace any libraries. What I do is run xmr-stak with the 18.10 runtime, let it compile the kernel binaries, and then run again with 18.30 runtime, but before that, I rename the 18.10 binaries so xmr-stak loads them, instead of the 18.30 compiled ones.

With rocm there was no need for this, CNv7 worked as is.

TheGoddessInari commented 6 years ago

I see. Prior to the fix for invalid hashes, I had to use the 18.10 OpenCL lib with the rest of the ROCM 1.9.x stack. I'm not willing to install the full amdgpu-pro drivers when the free stack works so well.

I wish I knew why things are so different for different people. I've been seeing a lot of people with Vega FE reporting the same thing, just can't get better than 50% regardless of which OpenCL compiler/driver version is in-use. Some Vega 56/64 can apparently work at full speed anyway, but not all. :/

nioroso-x3 commented 6 years ago

Theres no need to keep anything installed. After installing the pro drivers, just tar the amdgpu-pro folder. You can then uninstall the driver. To switch between libraries just set LD_LIBRARY_PATH to the amdgpu-pro/lib/x86_64-linux-gnu of the version you want to use.

qolii commented 6 years ago

I think I'm seeing exactly this with my Vega FE (i.e. 1700H/s -> 900 since the fork). Any idea what it is?

rumatadest commented 6 years ago

windows version working fine with adrenaline 18.5.x and 18.6.1 waiting for new rocm or amdgpu-pro

Josef3110 commented 6 years ago

I'm using rocm 1.9.1 and changing unroll to 1 helped a lot with both a vega 64 and a rx580. This brought the rx580 almost back to pre-fork hash rates. The vega is still slower but I got additional 200 H/s. This also points to a memory footprint problem.

psychocrypt commented 6 years ago

For vega try also work_size 16. I got access to a rocm system with vega from one community member and saw 2x1000 h/s for his vega. He used the l latest rocm version from the dev branch of rocm.

Josef3110 commented 6 years ago

I'm already using work_size 32, but I don't overclock, still 2000 H/s seems a far fetch from the 1500 I'm getting now.

psychocrypt commented 6 years ago

ohh the 2k was for monero. I missed that this is a POW heavy issue, sry

qolii commented 6 years ago

Ach, as did I! I misinterpreted "heavy" as "significant" :/

@psychocrypt and @jf3110, I will try your suggestions about work_size and unroll, thanks.

Spudz76 commented 6 years ago

Yes, always use CN-Heavy not just the word Heavy

minzak commented 5 years ago

Who have worked config for Vega FE ? (for now i get 1200h) Before i work with RX580, but how with FE, and how speedup it?

zviratko commented 5 years ago

I'm seeing ~1600H/s with cryptonight-heavy on a Vega64 (no tuning at all) But only ~1100H/s for monero. Same config and everything. I don't think that makes sense?

Gentoo, kernel 5.1.9, rocm 2.5.0 with in-kernel amdgpu driver.

psychocrypt commented 5 years ago

you can notcompare heavy and cryptonight-r (monero) pow. Both are dfferent. It is comparison between apple's and oranges. Use duaal thread for monero. the auto cfg should be ok but you must tune the interleave parameter for rocm.

zviratko commented 5 years ago

ah, I thought it just different in the scratchpad size, thus I thought if anything it should be slower

with single thread I see 1200H/s (actually higher or at least more consistent than with dual threads), with dual thread it drops to half. I previously tried on Ubuntu 18.04 with amdgpu-pro drivers and was able to get over 2000H/s (after tuning the dpm a bit, probably not worth it as the power usage skyrocketed), but I think with single thread I saw the same 1200H/s here. So to me it looks like dual thread is broken somehow.

How should tuning the interleave help? Isn't the idea only that they overlap? (which they do?)

Thanks for any tips

psychocrypt commented 5 years ago

if you have with native driver better performance use them. As I know you should not use dkms with rocm to get full performance

zviratko commented 5 years ago

I should have been more specific I got 2000H/s on Ubuntu with dkms drive and the whole amdgpu-pro stack, that was with xmrig miner. Now on Gentoo the best performance I get is ~1200H/s with xmr-stak, but all miners perform poorly. This is with the 5.1.9 in-kernel amdgpu driver and rocm-opencl stack, so it's possible the fault is in here. I was not able to get amdgpu-pro-opencl to work on Gentoo...

psychocrypt commented 5 years ago

if you likr to mine you should best use a driver supported OS. I also used gentoo in the past but I think it is not the ebst for mining. If it is you main system and you mine only from time to time than it is maybe required that you solve the driver issues

minzak commented 5 years ago

This is with the 5.1.9 in-kernel amdgpu driver and rocm-opencl stack

Hm, i think you not must use it in the same time. Only rocm-opencl or amdgpu And any of them without dkms. I user Debian and Rocm - all is work, no amdgpu is needed, because in newest version it is only work with Ubuntu only. And in latest version - not possible to use on Debian any more, too much hard coded for Ubuntu, even in ELF files (

Gentoo - it is very cool - i'm still can't loading Gentoo + SystemD + KDE on my Dell latitude e7470 (( Too much dependencies and no manual for that config.