albertozeni / LOGAN

LOGAN: High-Performance Multi-GPU X-Drop Long-Read Alignment.
Other
27 stars 4 forks source link

results not the same on different GPUs #2

Closed zjin-lcf closed 4 years ago

zjin-lcf commented 4 years ago

Though I am not familiar with your algorithm or application, I have a few questions for you. Thanks for your answers.

I ran the demo on a k80 gpu (1 gpu) and a p100 gpu, and printed the results. I also set the OMP_NUM_THREADS to 1. However, the results are different, so it is not clear if it is expected.

k80: 212 32 678 1442 188 81 126 26 3513 1365 19 17 19 15 17 17 17 15 19 17 17 15 19 17 17 19 80 206 1387 p100: 192 122 812 5669 552 179 2205 943 4107 3710 3103 1458 334 202 2353 253 53 176 560 931 406 1339 5603 41 549 1413 147 3467 4887

Could you please comment on the code change required to run correctly on a GPU with a warp size that is not equal to 32 ? For example, the warp size is 64 or 16.

albertozeni commented 4 years ago

I tested the same code on a k80 on the NIMBIX platform and the results do match (they are supposed to match regardless of the GPU), did you changed something in the code? To change the warp size the code that computes the maximum after computing the alignment of the singular antidiagonal needs to be changed accordingly. Although, as far as I know the warp size is always 32 for all compute capabilities, at least when using NVIDIA gpus.

zjin-lcf commented 4 years ago

The host is an ibm power9 machine for the k80. I compiled the demo after the clone. Did you get the same result ?

nvcc -O3 -maxrregcount=32 -std=c++11 -Xcompiler -fopenmp demo.cu -o demo -D ADAPTABLE

$ ./demo inputs_demo/example.txt 17 21 1 ///////////////////////////////////////////////1 Input setup time: 0.00528446 Input transfer and malloc time: 0.00364679 Compute time: 0.409596 212 32 678 1442 188 81 126 26 3513 1365 19 17 19 15 17 17 17 15 19 17 17 15 19 17 17 19 80 206 1387

albertozeni commented 4 years ago

The host machine that I used was not using an IBM CPU, although I used IBM CPUs for some tests when using the V100 and it always worked. Regardless of the CPU/GPU I got the following results: 192 122 812 5669 552 179 2205 943 4107 3710 3103 1458 334 202 2353 253 53 176 560 931 406 1339 5603 41 549 1413 147 3467 4887. Also setting the number of threads to 1 does not matter for the GPU, it only slows down the part that prepares the input for the GPU. Where are you printing those results? Which version of CUDA are you using?

zjin-lcf commented 4 years ago

So this matches the p100 result. It is difficult for you to debug without the access to the target machine. Anyway, I will try another IBM CPU. Thanks.

the nvcc version is 10.1.243

In demo.cu 134 extendSeedL(seeds_b, EXTEND_BOTHL, target_b, query_b, penalties, xdrop, ksize, res, numAlignmentsLocal, ngpus, maxt); 135 for (int j = 0; j < numAlignmentsLocal; j++) 136 printf("%d ", res[j]); 137 printf("\n"); 138 139 free(res);

zjin-lcf commented 4 years ago

I changed the wrap size in the following way. WARP_DIM is 64. Do you think these are all I need to change to run on an AMD GPU with a warp size of 64 ? Thanks.

__inline__ __device__ void warpReduce(volatile short *input, int myTId){
                input[myTId] = (input[myTId] > input[myTId + 64]) ? input[myTId] : input[myTId + 64];
                input[myTId] = (input[myTId] > input[myTId + 32]) ? input[myTId] : input[myTId + 32];
                input[myTId] = (input[myTId] > input[myTId + 16]) ? input[myTId] : input[myTId + 16];
                input[myTId] = (input[myTId] > input[myTId + 8]) ? input[myTId] : input[myTId + 8];
                input[myTId] = (input[myTId] > input[myTId + 4]) ? input[myTId] : input[myTId + 4];
                input[myTId] = (input[myTId] > input[myTId + 2]) ? input[myTId] : input[myTId + 2];
                input[myTId] = (input[myTId] > input[myTId + 1]) ? input[myTId] : input[myTId + 1];
}

__inline__ __device__ short reduce_max(short *input, int dim, int n_threads){
        unsigned int myTId = threadIdx.x;
        if(dim>WARP_DIM){
                for(int i = n_threads/2; i >WARP_DIM; i>>=1){
                        if(myTId < i){
                                                input[myTId] = (input[myTId] > input[myTId + i]) ? input[myTId] : input[myTId + i];
                        }__syncthreads();
                }//__syncthreads();
        }
        if(myTId<WARP_DIM)
                warpReduce(input, myTId);
        __syncthreads();
        return input[0];
}
albertozeni commented 4 years ago

The version of cuda that you are using should be fine, again I don't have that same combination of board/cpu but I tried the same board and CPU and it worked for me (I tried k80 with intel, v100 with intel, v100 with IBM). I cannot guarantee that the code will work on an AMD board by just changing the warp size, as I don't have access to an AMD board. BTW if you want to continue the discussion and explain me better what are you trying to achieve I will suggest you to contact me via email: alberto.zeni@mail.polimi.it