C++ error? Building on Power 8 Processor/Ubuntu 16.04/2x NV Tesla P100/

zackoch commented 7 years ago

Hello!

EDIT: I made a stupid. I've solved my own problem - I'm seeing 140MH/s F*&K Yeah!

jimmykl commented 7 years ago

@manpowre Any chance this could also improve performance on 1070s?

theobolo commented 7 years ago

@manpowre tried to find out some Doc about __dp4a():

https://devtalk.nvidia.com/default/topic/979016/unable-to-reach-full-throughput-on-titan-x-pascal-for-__dp4a/

https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/

http://docs.nvidia.com/cuda/pascal-tuning-guide/index.html#axzz4lqmyrzML

https://gist.github.com/ysh86/5d964cc6b5c6964a8853f14d31cde0bb

zackoch commented 7 years ago

Ugh I wish I was near a computer. What's the dp4a you guys are talking about? Some sort of function within the ethminer code? I'll run the tests you wanted @manpowre but it won't be until late tomorrow unfortunately with it being Independence day in the states.

manpowre commented 7 years ago

Yeah, I read all that, and tried different examples.. I searched whole night. then got to sleep a few hours. Basically, multiplying an unsigned 32 bit integer like this uin32_t * uint32_t can be optimized using dp4a(), which splits the 32 bit unsigned integer into 4 chars each, and puts them together. Doing that, dp4a() call in cuda uses full speed of GDDR5X or HBM memory !!!!

Nvidia did this to ensure that only 8 bit char integers can be using the full speed to ensure accuracy.

And, I will never overclock using __dp4a(), damn its getting hot and using watt.

But I need to figure out to get it to work properly, as Im not finding any blocks atm.. prob just something I missed.. then we will see the real performance once it can find blocks.

theobolo commented 7 years ago

@manpowre Alright so basically today the problem is that, ethminer is not fully using all the parallel potential of thoses GPUs ? And indeed the total memory Throughput

manpowre commented 7 years ago

For normal 32 bit integers, Nvidia is holding back its performance on gddr5x memory, and only allows 8 bit chars to get the full speed. so converting a 32 bit unsigned integer to 4x char's like the __dp4a() does, and multiplies them optimizes it. and this is exactly what is happening in every single cuda core millions of times a second.

theobolo commented 7 years ago

@manpowre Ok, after reading docs seems logical ...

manpowre commented 7 years ago

nvidia actually mentions this only 1 place.. in the dp4a documentation with parallell execution, it says something about "hi" speed with 8 bit chars and only on sm_60 and sm_61 platform. no wonder how we all missed this

theobolo commented 7 years ago

@manpowre That's exactly my thoughts that's really amazing ... but assuming that K80 are running 3.7 Compute on CUDA (i'm certainly wrong since i'm not fully understanding), can you reach "hi" speed with 8 bit chars even on thoses "old" cards ?

manpowre commented 7 years ago

according to nvidia doc's for parallell execution, they specifically mention this is only supported on compute 60 and compute 61. probably due to the way HBM memory for tesla P100 and GDDR5x for 1080/ti is.

theobolo commented 7 years ago

@manpowre Alright ;)

manpowre commented 7 years ago

when I tried to compile this the first time, all the other platforms were on, from compute 20 to 61. I had to turn off all of them except the one Im using hehe. once I get this to work, I will turn back on platform by platform going backwards and see which one will work and not.

theobolo commented 7 years ago

@manpowre alright, looking forward for this so :)

ghost commented 7 years ago

Hi @manpowre , can you please provide us your code change? I want to investigate it on my 1080 cards.

jimmykl commented 7 years ago

I think the concept has been proven and just has to be fully integrated into the miner. If it nets anything close to those numbers above I'll be selling my 1070s and moving to 1080Tis!

ghost commented 7 years ago

If @manpowre can provide us his code change I can also check the performance on 1060s, 1080s and RX480 cards to have a clear comparison.

theobolo commented 7 years ago

@JHGitty I think that he will share it when he'll get it working, we are all excited about that don't worry ;)

jimmykl commented 7 years ago

I'm sure he will when it's working. I'd say a pull request for this would get top priority. I can also test on 1070s. AFAIK this won't help RX cards since it is a CUDA improvement.

manpowre commented 7 years ago

still working on the implementation. the __dp4a() call is working and compiling, just not getting data as I want it.. very little documentation on the function call.. so its try and error atm. I do follow some example code I found on github, but thats for memory test and # operation per seconds. If I can get this to has the right way in the ethash_search() cuda kernel first, then I can move on to the compute_hash() kernel and optimize that one too.

BTW.. this is documented by Nvidia only to be working on compute_60 and compute_61 platform. Not sure which platform 1070 and 1060 cards have while writing this as my head is deep into the code atm.

It could very well be that the same method can be applied to AMD platform, as this uint32_t to 4xchar is not new, just that its been implemented by Nvidia recently into their cuda platform.

See section: 1.3. PTX ISA Version 5.0

It says: PTX ISA version 5.0 introduces the following new features: ‣ Support for sm_60, sm_61 target architecture. ‣ Extends atomic and reduction instructions to perform fp64 add operation. ‣ Extends atomic and reduction instructions to specify scope modifier. ‣ A new .common directive to permit linking multiple object files containing declarations of the same symbol with different size. ‣ A new dp4a instruction which allows 4-way dot product with accumulate operation. ‣ A new dp2a instruction which allows 2-way dot product with accumulate operation. ‣ Support for special register %clock_hi

theobolo commented 7 years ago

jimmykl commented 7 years ago

SM60 – GP100/Pascal P100 – DGX-1 (Generic Pascal) SM61 – GTX 1080, 1070, 1060, Titan Xp, Tesla P40, Tesla P4 (from http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/)

jimmykl commented 7 years ago

Oh snap! :-)

DaemonHell commented 7 years ago

The GP102 (Tesla P40 and NVIDIA Titan X), GP104 (Tesla P4), and GP106 GPUs all support instructions that can perform integer dot products on 2- and4-element 8-bit vectors, with accumulation into a 32-bit integer

( https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/0 ) (where gp104 - 1070,1080, gp102 - 1080ti, gp106 - 1060)

ghost commented 7 years ago

Hi @manpowre, thank you for your fast answer.

You mentioned that you found a way to run the code with the improvement but you are not able to find a block with it. Can you push a PR here to show us your code? Maybe then somebody will find the issue. Together we can work faster to find a solution! :+1:

manpowre commented 7 years ago

Hangon, lets see.. its only a few lines of code. Im no cuda expert.. what Im not sure about is if __dp4a() is returning a new uint32_t, or if the first argument is referencing a variable the new uint32_t is referenced. it takes 3 arguments.

So my experiment is in this file: https://github.com/ethereum-mining/ethminer/blob/master/libethash-cuda/ethash_cuda_miner_kernel.cu

Replace line 32, uint32_t const gid = blockIdx.x * blockDim.x + threadIdx.x; with this:

    uint32_t gid2 = 0;
    uint32_t in1 = blockIdx.x;
    uint32_t in2 = blockDim.x;
    uint32_t tx = threadIdx.x;

    //__dp4a(gid2,in1 , in2);
    //uint32_t const gid = gid2 + tx;
    uint32_t const gid = __dp4a(gid2,in1, in2) + tx;

and then you need to remove all the old compute_platforms in the file: https://github.com/ethereum-mining/ethminer/blob/master/libethash-cuda/CMakeLists.txt

Remove these compute platforms:

    "-gencode arch=compute_30,code=sm_30"
    "-gencode arch=compute_35,code=sm_35"
    "-gencode arch=compute_50,code=sm_50"
    "-gencode arch=compute_52,code=sm_52"

theobolo commented 7 years ago

Ok so not working on K80, since the Compute 3.5 is not present.

manpowre commented 7 years ago

I guess you cant compile it on 3.5 right ?

theobolo commented 7 years ago

@manpowre Yep you right, and then when i'm lauching ethminer on K80 without Compute 30 compiled, i get an error and ethminer is crashing...

I'm looking further into the code to try to understand if we could make it more async ...

manpowre commented 7 years ago

According to this example, its using assembly optimization on compute 61. https://github.com/GPU-correlators/xGPU/blob/feature/dp4a/src/shared_transfer_4_dp4a.cuh

while on 6.0 its doing it by char, inserting multiplication into the first argument c.

So I put this function into my code:

inline __device__ void dp4a(int &c, const int &a, const int &b) {
#if __CUDA_ARCH__ >= 610
  asm("dp4a.s32.s32 %0, %1, %2, %3;" : "+r"(c) : "r"(a), "r"(b), "r"(c));
#else
  char4 &a4 = *((char4*)&a);
  char4 &b4 = *((char4*)&b);
  c += a4.x*b4.x;
  c += a4.y*b4.y;
  c += a4.z*b4.z;
  c += a4.w*b4.w;
#endif
}

and using the dp4a() call instead of the __dp4a which calls the cudacall.

theobolo commented 7 years ago

@manpowre Can't test it :/ sry

jimmykl commented 7 years ago

Does that function mean you can replace the line with:

uint32_t const gid = dp4a(blockIdx.x, blockDim.x, threadIdx.x)

?

manpowre commented 7 years ago

actually you can, remove the if statement, the assembly line, and else + endif statement. just doing char4 calculation instead and then call the dp4a() instead of cuda __dp4a().

Dont forget, I cant find any blocks with this atm.. so its just very very experimental.. need to get some decent cuda expert on this to figure out what Im doing wrong. prob something with the variables being referenced.

theobolo commented 7 years ago

@manpowre Oh alright !

ghost commented 7 years ago

@manpowre Maybe we can also use the other new functions of CUDA 8?

To enable you to write your own code using these data types, CUDA provides built-in data types (e.g. half and half2) and intrinsics for FP16 arithmetic (e.g. hadd(), hmul(), hfma2()) and new vector dot products that operate on INT8 and INT16 values (dp4a(), __dp2a()).

Source: https://devblogs.nvidia.com/parallelforall/cuda-8-features-revealed/

manpowre commented 7 years ago

Well, good point, it might be why I cant find blocks with this. or it means they only made the __dp4a call for char objects only to ensure precision.

theobolo commented 7 years ago

@manpowre Not compiling for the moment :

And the code :

jimmykl commented 7 years ago

FWIW I compiled your initial change and for 6x 1070s:

0.11.0rc2 ~186MH/s
new version ~450Mh/s

But obviously no solutions found…

ghost commented 7 years ago

@manpowre You use "uint32_t" but the examples I can find are only using "int" or "char". This could be the issue?

jimmykl commented 7 years ago

I found this https://devtalk.nvidia.com/default/topic/979016/b/t/post/5027625/ which seems to reference using __dp4a with unsigned int which I guess is an unsigned integer?

__global__ void DoDP4A(int *in1d, int *in2d, int* in3d, int* outd) {
    int tx = threadIdx.x;
    int in1 = in1d[tx];
    int in2 = in2d[tx];
    int in3 = in3d[tx];
    int out;
    for (int i = 0; i < ITER; i++) {
      out += __dp4a(in1, in2, in3);
    }
    outd[tx] = out;
}

ghost commented 7 years ago

@jimmykl Your code says "int" and not "uint" - or what do I misunderstand?

manpowre commented 7 years ago

@theobolo

    int in1 =blockIdx.x;
    int in2 = blockDim.x;
    int gid = threadIdx.x;
    dp4a(gid,in1 , in2);

instead of the line with the dp4a call. you can see the c is a reference, and adding the multiplications to the reference. so gid will be left with the calculation.

jimmykl commented 7 years ago

@JHGitty I think you're correct but this is all way above my level of knowledge. I'm just trying to find possible similar code in the hope it might help :-)

manpowre commented 7 years ago

@jimmykl yepp, tried that too.

ghost commented 7 years ago

According to this page any usage of unsigned int should be no problem for __dp4a: https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/

For convenience, there are both int and char4 versions of the DP4A intrinsics, in both signed and unsigned flavors:
__device__ int __dp4a(int srcA, int srcB, int c);
__device__ int __dp4a(char4 srcA, char4 srcB, int c);
__device__ unsigned int __dp4a(unsigned int srcA, unsigned int srcB, unsigned int c);
__device__ unsigned int __dp4a(uchar4 srcA, uchar4 srcB, unsigned int c);

And by the way - here again:

Keep in mind that DP2A and DP4A are available on Tesla, GeForce, and Quadro accelerators based on GP102, GP104, and GP106 GPUs, but not on the Tesla P100 (based on the GP100 GPU).

manpowre commented 7 years ago

@JHGitty nice find, where did you find that ?

sorry for moving from uint32_t to int. Ive just been trying to get this to a level where I can control the code since I havent found great documentation about the function call __dp4a() from nvidia.

Ohh.. damn, its further down in the document.. ahh.. great. I missed the bottom part.. gotta read it.

ghost commented 7 years ago

@manpowre My post contains a link to the devblogs nvidia blog post - just search for the code on that page. I found it with Google.

theobolo commented 7 years ago

@manpowre This code is compiling :

But there is no difference on the hash rate and i'm not able to find solution too 👍

ghost commented 7 years ago

Hi @theobolo, which card did you use? The DP4A improvement will never work on K80 cards.

It will only be supported on these cards: SM60 – GP100/Pascal P100 – DGX-1 (Generic Pascal) SM61 – GTX 1080, 1070, 1060, Titan Xp, Tesla P40, Tesla P4

theobolo commented 7 years ago

@JHGitty So that why xD :)

manpowre commented 7 years ago

@JHGitty its because the 8 bit char isnt optimized on the type of memory for K80. Im running exactly the same code as you now on my 1080ti's and I dropped just a bit in performance by 10mh/s (still not able to find block).

I dont think well see 240mh/s in the final code, but mabye 1.5 times performance for 1080ti's and 1080 cards since they have GDDR5X memory.

atleast with that dp4a() function outside of cuda, any c++/cuda developer can help out here.

ethereum-mining / ethminer

C++ error? Building on Power 8 Processor/Ubuntu 16.04/2x NV Tesla P100/ #68