isaacleeai commented 6 years ago

Hello,

I am trying to optimize nv_wavenet_persistence.cu to be able to run on GTX 1080.

I am using the below arguements: -l 20 -r 64 -s 256 -a 256 -b 1 -d 2 -m 3

First, in order to even run persistence on GTX 1080 with the above parameters, I needed to decrease "maxrregcount" from 128 to 64. This resulted in a very big performance decrease as a huge portion of memory that used to be stored in registers are now stored in the local memory.

Second, in order to decrease register leaks, I want to eliminate all the register variables ( i.e. variables with "_reg" at the end ). As I was looking through the code to eliminate such variables, I realized that most of these were for zero negative checking.

So I tried a very simple and blunt approach: I replaced this code from nv_wavenet_persistent_GEMM_MxK function

` bool valid = false; while (!valid) { valid = true;

pragma unroll

                    for (int b=0; b<N_UNROLL; b++) {
                        act_in_reg[b] = loadVolatile(act_in,(batch_offset+b)*ldb + row);
                    }

pragma unroll

                    for (int b=0; b<N_UNROLL; b++) {
                        valid &= !isNegativeZero(act_in_reg[b]);
                    }
                }

pragma unroll

                // fill [batch_size x R] shared vector
                for (int b=0; b<N_UNROLL; b++) {
                    act_in_sh[b][row] = act_in_reg[b];
                }`

with

act_in_sh[0][row] = loadVolatile(act_in,(batch_offset)*ldb + row);

As I expected this does not work, but I am not sure why it does not.

My question is:

what does "isNegativeZero" do? Because it is assembly, I am having a hard time reading.
is there a way to minimize the use of registers s.t. 64 registers per thread would allow sufficient performance?
what is the need for the check that is performed in the bigger piece of code above?
Why initiate activations with -0.f instead of 0.f?

I would really appreciate your help.

Thanks!

BrianPharris commented 6 years ago

isNegativeZero is used for synchronization between thread blocks -- the data is initialized to negative zero as an invalid value -- the consumer block polls on that data to become valid.

So by changing the code to no longer poll but to read the data unconditionally, you're no longer waiting for the producer blocks to have provided the necessary data.

BrianPharris commented 5 years ago

manyblock mode should now allow you to achieve near-persistent performance without the requirement to fit on chip. Closing this issue.

NVIDIA / nv-wavenet

Attempting to optimize persistence for GTX 1080 #59

pragma unroll

pragma unroll

pragma unroll