Vector add learning - Githubissues

scarlehoff commented 5 years ago

I've modified our host.cc to be similar to that from the xilinx repository and I've using our own makefile where we can run different kernels by doing

make run TARGET=sw_emu KERNEL=vector_add_xilinx_repo

for now there are two kernels, the simple one which is like the intel and the one from the xilinx repo.

I'm also using DOUBLE defined at the moment as double but that way we can check whether the datatype actually makes a difference.

(if you already started feel free to force push)

scarlehoff commented 5 years ago

First benchmark, GPU for 10^8 numbers:

Simple: 0.63s
Xilinx: 19s

CPU for 10^8 numbers:

Simple: 0.52
Xilinx: 0.63s

scarrazza commented 5 years ago

Maybe worth read this: https://iris.polito.it/retrieve/handle/11583/2669854/193982/07859319.pdf

scarrazza commented 5 years ago

In few words, if they managed to perform MC with financial data for sure we can do similar for integrals.

scarrazza commented 5 years ago

and there is code: https://github.com/HLSpolito!!

scarrazza commented 5 years ago

Maybe we can understand something by comparing: https://github.com/HLSpolito/KNN/blob/master/knn_fpga/nn_fpga.cl with https://github.com/HLSpolito/KNN/blob/master/knn_cpu/nn_cpu.cl

scarlehoff commented 5 years ago

I'll have a look at it tomorrow.

In the meantime I have results for the vector addition (with the simple kernel). All of them with 10^7 events (the FPGA fails at 10^8 because of mem issues ??? )

FPGA: 0.8s CPU: 0.5s GPU: 0.3s

But, interestingly enough, if I just do 10^6 events: FPGA: 0.16s CPU: 0.46s GPU: 0.27s

All of this for the simple kernel.

scarrazza commented 5 years ago

umm, strange that reducing size makes it faster... but interesting.

scarlehoff commented 5 years ago

umm, strange that reducing size makes it faster... but interesting.

There is a baseline of 0.13s even if you run it with just 1 event, so from 1 to 10^6 events it basically introduces an overhead of 3s (probably because of memory transfer). It is similar for the GPU, but the baseline is higher (0.3s).

Might be that the trick is just running many times the loop with different data but making sure we don't get above the threshold? It sound cheap to me...

The same happens with the integrator btw, but the threshold is at 10^4.

scarlehoff commented 5 years ago

It would be interesting to see the profile with such a number of events to see where the problem is. Maybe we are creating a bottleneck somewhere which is making the FPGA useless for our purposes?

scarlehoff commented 5 years ago

Ok, both implementations of the kernel have the same speed up to 10^6 but at 10^7:

Simple kermel: 0.9s Complicated kernel: 0.25s

(more than that you get bad alloc, the arrays are too big I imagine)

scarlehoff commented 5 years ago

This last commit (in theory) uses more than 1 memory bank. In theory in order to use the RAM we just need to tell it to use DDR instead of HBM.

In OpenCL you are not to used the pragmas of the example but instead use max_memory_ports https://www.xilinx.com/html_docs/xilinx2019_1/sdaccel_doc /wrj1504034328013.html?hl=max_memory_ports

It runs on the emulator, I am compiling the hw now.

scarlehoff commented 5 years ago

Indeed, now it is using 3 banks of memory and we can run with 3*10^7 as we thought.

scarlehoff commented 5 years ago

Using the RAM and only the RAM we don't lose anything and we don't hit the limit. Maybe it is just because this example is very simple but it looks to me like using the HBM is useless.

---- 10^6 RAM -> 0.1362820 HBM -> 0.1445910s GPU -> 0.2663280s

---- 3*10^7 (near 256MB limit) RAM -> 0.52s HBM -> 0.52s GPU -> 0.4s

---- 10^8 RAM -> 1.5s HBM -> Can't because of the 256MB-per-array limit GPU -> 0.6s

The GPU uses the other kernel btw.

Insomma, my guess is we are still doing something not 100% ok because we get a big penalty as we go near the 256MB limit but this is orders of magnitude better than what we had on Thursday so I am very happy.

scarrazza commented 5 years ago

Cool, this example is pretty good, the fpga is close to the gpu performance design, for sure this will work very well for Vegas.

scarlehoff commented 5 years ago

Figure_1

The results plotted, notice both FPGA and GPU suffer around 256MB but it is much much worse for the FPGA (xaxis is the number of elements, note that the kernels are different though)

scarrazza commented 5 years ago

Strange, I have tried to run the code with the GPU on idle and 50C and I get this:

scarlehoff commented 5 years ago

In the GPU I am running the vector_add_simple kernel and I've waited 30s between runs. Not sure whether it makes a difference.

scarrazza commented 5 years ago

Umm, just tried after 10 minutes of idle:

12:10 $ make run-gpu EVENTS=100000000
./cpp-opencl vector_add_simple kernel.cl 100000000 0
Found 3 platforms:
[0] NVIDIA CUDA
[1] Xilinx
[2] Intel(R) CPU Runtime for OpenCL(TM) Applications
Selected: NVIDIA CUDA
Reading kernel.cl
Finished running OCL kernel, took: 1.2904000 seconds
Result checker: passed!

scarrazza commented 5 years ago

BTW getting better results for CPU is perfectly fine, vector add is a memory bounded problem.

scarlehoff commented 5 years ago

$ make run-gpu EVENTS=100000000
./cpp-opencl vector_add_simple kernel.cl 100000000 0
Found 3 platforms:
[0] NVIDIA CUDA
[1] Xilinx
[2] Intel(R) CPU Runtime for OpenCL(TM) Applications
Selected: NVIDIA CUDA
Reading kernel.cl
Finished running OCL kernel, took: 0.7178870 seconds
Result checker: passed!

And if I run again


$ make run-gpu EVENTS=100000000
./cpp-opencl vector_add_simple kernel.cl 100000000 0
Found 3 platforms:
[0] NVIDIA CUDA
[1] Xilinx
[2] Intel(R) CPU Runtime for OpenCL(TM) Applications
Selected: NVIDIA CUDA
Reading kernel.cl
Finished running OCL kernel, took: 1.3406170 seconds
Result checker: passed!

It might depend of the random numbers being added. Or the computer doing other things with the ram at the same time.

scarrazza commented 5 years ago

You are running with double right? The FPGA then seems really good in comparison to the GPU if the data size is not too big. But anyway, this example is already becoming misleading, is quite good to understand the different memory managements and work item, in particular this post https://forums.xilinx.com/t5/SDAccel/opencl-kernel/td-p/913785 confirms that the {1,1,1} approach is the way to go.

To me remains obscure two things:

if the local copies are really necessary to get speed improvements with both HBM and DRAM
what are the advantages and disadvantages of using a particular local size, see https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/ece1504034297316.html

scarlehoff commented 5 years ago

To me remains obscure two things:

if the local copies are really necessary to get speed improvements with both HBM and DRAM

We know it is with the HBM because we found so in the case of the integrator. We need to test with the DRAM.

what are the advantages and disadvantages of using a particular local size, see https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/ece1504034297316.html

To me the link of the forum and this link contradict each other in every sense. I think the bottom line is that the compiler is able to do very well in very simple escenarios (like the vector add) but the moment the computation is more complicated you need to set (1,1,1) and have everything written explicitly or the compiler will go crazy.

scarrazza commented 5 years ago

Ok, here my trials, not bad at all, for this example/board the local copy seems useless:

scarlehoff commented 5 years ago

But I believe the trick in this case is that the compiler is able to do the best thing because is so simple.

In the other case (the integrstor) there is also the problem of the same stuff being used at several stages of the loop.

I've been writting the integrator kernel following the spirit of this example but it didn't work well (the result was wrong). On Monday I'll try again

scarlehoff commented 5 years ago

The good news after reading the guide is that I know now of a few things that we can do to the MC (and also the vector add) much better. I'll play with these ideas in the vector add first as it is easier.

The bad news is that the part of the "loops with dependencies" (which apparently is a very important topic, I am not surprised) directs you to a 592 pages guide I am scared to open.

scarrazza commented 5 years ago

Great thanks. What are the document numbers?

scarlehoff commented 5 years ago

Is this one https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_1/ug902-vivado-high-level-synthesis.pdf (we only have to care about chapter 3 actually, but saying "a 592 pags guide" sounds more impressive)

scarrazza commented 5 years ago

Ok, the guide you have been looking so far is the https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_1/ug1207-sdaccel-optimization-guide.pdf ?

scarlehoff commented 5 years ago

Btw, the most important thing I feel will be Appendix A of UG1277 where they introduce the Streaming Platform. This is a fairly new API (added just this year) but in the best case scenario if there are not unforeseen problems would allow to constantly pass the results of the integration to the CPU (in appropriately-sized chunks) for it to perform the accumulation which in turn would allow the full integration to be parallelized.

scarlehoff commented 5 years ago

I am using at the moment the HBM memory banks. The RAM seems slightly better for this example but filling the 256MB is satisfying.

The command I am always running is:

./driver vector_add_xilinx_repo <xclin_file> 30000000 1

Anyway, I have several kernels compiled to test. Before this commit we get an average after 10 runs of: 0.66 +- 0.05s

With a67760922ccea6e9 we get instead: 0.541 +- 0.045s

This commits in principle allows the kernel to perform overlap operations. In this kernel the computation is three functions:

1. copy_data(arrayA, &a[i], BUFFER_SIZE);
2. copy_data(arrayB, &b[i], BUFFER_SIZE);
3. computation(&c[i], arrayA, arrayB, BUFFER_SIZE);

After the first iteration of (1) and (2) has happened, part of the FPGA can start computing (3) while the rest continues with the next iteration of (1) and (2). Beyond that there is another level of parallelism in that iterations of (1), (2) and (3) are happening in parallel to the best of the FPGA capabilities. In a cartoonish way: cycle 1: (1) do the first iteration 2: (2) does the first iteration and (1) does the second iteration 3: (3) does the first iteration, (2) does the second iteration, (1) does the third iteration

Note that this is different from unrolling which would be more like: cycle 1: (1) does all iterations

In principle this example would benefit from unrolling but in real life it can't hardly be used...

(I'm documenting what I do for my own future benefit.)

I'll keep adding new features and bench-marking them. Of course, some of them will likely not change much the numbers because this is such an easy example, but anything that doesn't make the results worse can be considered an improvement.

scarlehoff commented 5 years ago

The streaming API is not documented for OpenCL and there are no examples for it. I wonder whether we should move to C++-kernels-which-are-mostly-C-anyway as those are the ones which are fully documented...

scarrazza commented 5 years ago

Probably yes, at this point opencl is just a marketing keyword, and a host implementation.

scarlehoff commented 5 years ago

Intel for instance gives you an example of creating a pipe-to-host https://www.intel.com/content/www/us/en/programmable/support/support-resources/design-examples/design-software/opencl/host-pipe.html

scarlehoff commented 5 years ago

The only way I've found of doing something similar to the streaming API is by using two separate CU. If this works (it is compiling now) we can try implementing the C++ version and use this as the OpenCL benchmarking point to check the speed.

The nice thing of using 2 CU is that it maps very well to the RAM as each CU deals with one and only one RAM. I am not sure if the hw_emu is trustworthy in this case but it showed a 50% reduction of the speed (which points to the idea that the kernel doing more than one calculation at a time was actually well-founded).

scarlehoff commented 5 years ago

Well, this closes the issue on host-streaming in OpenCL https://github.com/Xilinx/SDAccel_Examples/issues/56#issuecomment-544079527 There are some features not supported for OpenCL kernels so ¯_(ツ)_/¯

scarlehoff commented 5 years ago

As per the email, the examples for the streaming API are not supported in our FPGA so all in all I think we can as well remain in OpenCL.

It is necessary to rethink how to do the ML intergration. The only bottleneck I've found are the random numbers so we might just give those in from the host

N3PDF / mcgpu

Vector add learning #18