N3PDF / mcgpu

Proof of concept of GPU integration
0 stars 0 forks source link

Vegas: a different approach #19

Closed scarlehoff closed 4 years ago

scarlehoff commented 5 years ago

The only way of taking advantage of the parallelization of the FPGA is to generate the random numbers outside and the accumulation as well since they are the only read-write arrays in the algorithm.

It seems to compile, it is obviously very slow in GPU and CPU as it would do everything in one core.

One interesting aspect is that using dataflow forces you to use different memory banks.

One problem remains in this code and is that the dataflow is not perfect because there are a few arrays that are BUFFER - DIMS and others DIMS - BUFFER so the code has to be changed a bit to ensure that all loops are looping over the same variable (which should be buffer imo)

scarrazza commented 5 years ago

Is it normal that GPU does not work?

scarlehoff commented 5 years ago

Yes, only for CPU and FPGA. For GPU I don't remember well which broke it, either the outoforder execution or the enqueue task instead of the ndrange

scarlehoff commented 5 years ago

With this commit the generation of random points is a bit slower than the kernel which is a good milestone for a simple kernel such as this one. 10^6 events take 4.15 seconds.

scarrazza commented 5 years ago

But is this faster than CPU?

scarlehoff commented 5 years ago

It should be*. At this point is an unfair comparison because if the generation of random points takes that long it means most of the time of the FPGA is spent copying the arrays.

*it should be means: it is not, but without a more complicated kernel the comparison does not make sense.

I'm going to do a few more test adding extra CU and using HBM/DRM memories and then I'll add some more compicated integrands.

scarlehoff commented 5 years ago

This last version does 40000 events in 0.4s in the FPGA and more than 90s in the CPU.

I must say however, if the previous version was unfair for the FPGA, this one is unfair for the CPU as we are letting it run a kernel highly optimized for a different device. At the end of the day the comparison will need to be done between a CPU-vegas, a GPU-vegas and a FPGA-vegas.

stdout

CPU

 make run-cpu EVENTS=40000
./cpp-opencl 40000 4 kernel.cl 2
Found 3 platforms:
[0] NVIDIA CUDA
[1] Xilinx
[2] Intel(R) CPU Runtime for OpenCL(TM) Applications
Selected: Intel(R) CPU Runtime for OpenCL(TM) Applications
Reading kernel.cl
For iteration 1, result: 1.01371 +- 0.00000
For iteration 2, result: 1.01296 +- 0.00189
For iteration 3, result: 1.01798 +- 0.00434
For iteration 4, result: 1.01956 +- 0.00666
For iteration 5, result: 1.01595 +- 0.00908
Final result: 1.01371 +- 0.00000
It took: 93.4663620 seconds

FPGA

make run TARGET=hw EVENTS=40000
./driver 40000 4 ./xclbin/bitstream.hw.xilinx_u280_xdma_201910_1.xclbin 1
Found 3 platforms:
[0] NVIDIA CUDA
[1] Xilinx
[2] Intel(R) CPU Runtime for OpenCL(TM) Applications
Selected: Xilinx
Loading ./xclbin/bitstream.hw.xilinx_u280_xdma_201910_1.xclbin
For iteration 1, result: 0.88713 +- 0.06086
For iteration 2, result: 1.02310 +- 0.07007
For iteration 3, result: 0.92468 +- 0.05810
For iteration 4, result: 1.03526 +- 0.05480
For iteration 5, result: 1.00662 +- 0.04703
Final result: 0.97856 +- 0.02536
It took: 0.3769200 seconds
scarlehoff commented 5 years ago

In this commit (cioè, with a non-trivial loop) the FPGA is definitely slower. I am bit sad, I am not sure in which dimension (buffer size, size of the integrand) will the FPGA ever win as an accelerator. It seems to be good only for very simple things that can be 100% compiled to hardware...

It is a result as well "FPGAs as accelerators of complicated calculations are expensive crap , buy GPUs" but it is not the result we would like to have...

scarrazza commented 5 years ago

Umm, could be, but then I am curious about what they claim "acceleration", for example here:

https://github.com/Xilinx/SDAccel_Examples/tree/master/acceleration/kmeans

scarlehoff commented 5 years ago

The biggest unrolling they do in that code (as well as the maximum reduction) is of a loop of size 2, everything else is completely parallel.

scarrazza commented 5 years ago

Ok, but then a MC code without loops should take some advantage, no?

scarlehoff commented 5 years ago

Ok, but then a MC code without loops should take some advantage, no?

I am not sure.

scarlehoff commented 4 years ago

ok, let's go back to this.

I will implement some of the thing I learned in L'Aia. First I'll change to use the C++ kernels, then I'll move to use customized precision and only then we do benchmarks.