Closed scarlehoff closed 4 years ago
First benchmark, GPU for 10^8 numbers:
CPU for 10^8 numbers:
Maybe worth read this: https://iris.polito.it/retrieve/handle/11583/2669854/193982/07859319.pdf
In few words, if they managed to perform MC with financial data for sure we can do similar for integrals.
and there is code: https://github.com/HLSpolito!!
Maybe we can understand something by comparing: https://github.com/HLSpolito/KNN/blob/master/knn_fpga/nn_fpga.cl with https://github.com/HLSpolito/KNN/blob/master/knn_cpu/nn_cpu.cl
I'll have a look at it tomorrow.
In the meantime I have results for the vector addition (with the simple kernel). All of them with 10^7 events (the FPGA fails at 10^8 because of mem issues ??? )
FPGA: 0.8s CPU: 0.5s GPU: 0.3s
But, interestingly enough, if I just do 10^6 events: FPGA: 0.16s CPU: 0.46s GPU: 0.27s
All of this for the simple kernel.
umm, strange that reducing size makes it faster... but interesting.
umm, strange that reducing size makes it faster... but interesting.
There is a baseline of 0.13s even if you run it with just 1 event, so from 1 to 10^6 events it basically introduces an overhead of 3s (probably because of memory transfer). It is similar for the GPU, but the baseline is higher (0.3s).
Might be that the trick is just running many times the loop with different data but making sure we don't get above the threshold? It sound cheap to me...
The same happens with the integrator btw, but the threshold is at 10^4.
It would be interesting to see the profile with such a number of events to see where the problem is. Maybe we are creating a bottleneck somewhere which is making the FPGA useless for our purposes?
Ok, both implementations of the kernel have the same speed up to 10^6 but at 10^7:
Simple kermel: 0.9s Complicated kernel: 0.25s
(more than that you get bad alloc, the arrays are too big I imagine)
This last commit (in theory) uses more than 1 memory bank. In theory in order to use the RAM we just need to tell it to use DDR instead of HBM.
In OpenCL you are not to used the pragmas of the example but instead use max_memory_ports
https://www.xilinx.com/html_docs/xilinx2019_1/sdaccel_doc
/wrj1504034328013.html?hl=max_memory_ports
It runs on the emulator, I am compiling the hw now.
Indeed, now it is using 3 banks of memory and we can run with 3*10^7 as we thought.
Using the RAM and only the RAM we don't lose anything and we don't hit the limit. Maybe it is just because this example is very simple but it looks to me like using the HBM is useless.
---- 10^6 RAM -> 0.1362820 HBM -> 0.1445910s GPU -> 0.2663280s
---- 3*10^7 (near 256MB limit) RAM -> 0.52s HBM -> 0.52s GPU -> 0.4s
---- 10^8 RAM -> 1.5s HBM -> Can't because of the 256MB-per-array limit GPU -> 0.6s
The GPU uses the other kernel btw.
Insomma, my guess is we are still doing something not 100% ok because we get a big penalty as we go near the 256MB limit but this is orders of magnitude better than what we had on Thursday so I am very happy.
Cool, this example is pretty good, the fpga is close to the gpu performance design, for sure this will work very well for Vegas.
The results plotted, notice both FPGA and GPU suffer around 256MB but it is much much worse for the FPGA (xaxis is the number of elements, note that the kernels are different though)
Strange, I have tried to run the code with the GPU on idle and 50C and I get this:
In the GPU I am running the vector_add_simple
kernel and I've waited 30s between runs. Not sure whether it makes a difference.
Umm, just tried after 10 minutes of idle:
12:10 $ make run-gpu EVENTS=100000000
./cpp-opencl vector_add_simple kernel.cl 100000000 0
Found 3 platforms:
[0] NVIDIA CUDA
[1] Xilinx
[2] Intel(R) CPU Runtime for OpenCL(TM) Applications
Selected: NVIDIA CUDA
Reading kernel.cl
Finished running OCL kernel, took: 1.2904000 seconds
Result checker: passed!
BTW getting better results for CPU is perfectly fine, vector add is a memory bounded problem.
$ make run-gpu EVENTS=100000000
./cpp-opencl vector_add_simple kernel.cl 100000000 0
Found 3 platforms:
[0] NVIDIA CUDA
[1] Xilinx
[2] Intel(R) CPU Runtime for OpenCL(TM) Applications
Selected: NVIDIA CUDA
Reading kernel.cl
Finished running OCL kernel, took: 0.7178870 seconds
Result checker: passed!
And if I run again
$ make run-gpu EVENTS=100000000
./cpp-opencl vector_add_simple kernel.cl 100000000 0
Found 3 platforms:
[0] NVIDIA CUDA
[1] Xilinx
[2] Intel(R) CPU Runtime for OpenCL(TM) Applications
Selected: NVIDIA CUDA
Reading kernel.cl
Finished running OCL kernel, took: 1.3406170 seconds
Result checker: passed!
It might depend of the random numbers being added. Or the computer doing other things with the ram at the same time.
You are running with double
right? The FPGA then seems really good in comparison to the GPU if the data size is not too big. But anyway, this example is already becoming misleading, is quite good to understand the different memory managements and work item, in particular this post https://forums.xilinx.com/t5/SDAccel/opencl-kernel/td-p/913785 confirms that the {1,1,1} approach is the way to go.
To me remains obscure two things:
To me remains obscure two things:
if the local copies are really necessary to get speed improvements with both HBM and DRAM
We know it is with the HBM because we found so in the case of the integrator. We need to test with the DRAM.
what are the advantages and disadvantages of using a particular local size, see https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/ece1504034297316.html
To me the link of the forum and this link contradict each other in every sense. I think the bottom line is that the compiler is able to do very well in very simple escenarios (like the vector add) but the moment the computation is more complicated you need to set (1,1,1) and have everything written explicitly or the compiler will go crazy.
Ok, here my trials, not bad at all, for this example/board the local copy seems useless:
But I believe the trick in this case is that the compiler is able to do the best thing because is so simple.
In the other case (the integrstor) there is also the problem of the same stuff being used at several stages of the loop.
I've been writting the integrator kernel following the spirit of this example but it didn't work well (the result was wrong). On Monday I'll try again
The good news after reading the guide is that I know now of a few things that we can do to the MC (and also the vector add) much better. I'll play with these ideas in the vector add first as it is easier.
The bad news is that the part of the "loops with dependencies" (which apparently is a very important topic, I am not surprised) directs you to a 592 pages guide I am scared to open.
Great thanks. What are the document numbers?
Is this one https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_1/ug902-vivado-high-level-synthesis.pdf (we only have to care about chapter 3 actually, but saying "a 592 pags guide" sounds more impressive)
Ok, the guide you have been looking so far is the https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_1/ug1207-sdaccel-optimization-guide.pdf ?
Btw, the most important thing I feel will be Appendix A of UG1277 where they introduce the Streaming Platform. This is a fairly new API (added just this year) but in the best case scenario if there are not unforeseen problems would allow to constantly pass the results of the integration to the CPU (in appropriately-sized chunks) for it to perform the accumulation which in turn would allow the full integration to be parallelized.
I am using at the moment the HBM memory banks. The RAM seems slightly better for this example but filling the 256MB is satisfying.
The command I am always running is:
./driver vector_add_xilinx_repo <xclin_file> 30000000 1
Anyway, I have several kernels compiled to test. Before this commit we get an average after 10 runs of: 0.66 +- 0.05s
With a67760922ccea6e9 we get instead: 0.541 +- 0.045s
This commits in principle allows the kernel to perform overlap operations. In this kernel the computation is three functions:
1. copy_data(arrayA, &a[i], BUFFER_SIZE);
2. copy_data(arrayB, &b[i], BUFFER_SIZE);
3. computation(&c[i], arrayA, arrayB, BUFFER_SIZE);
After the first iteration of (1) and (2) has happened, part of the FPGA can start computing (3) while the rest continues with the next iteration of (1) and (2). Beyond that there is another level of parallelism in that iterations of (1), (2) and (3) are happening in parallel to the best of the FPGA capabilities. In a cartoonish way: cycle 1: (1) do the first iteration 2: (2) does the first iteration and (1) does the second iteration 3: (3) does the first iteration, (2) does the second iteration, (1) does the third iteration
Note that this is different from unrolling which would be more like: cycle 1: (1) does all iterations
In principle this example would benefit from unrolling but in real life it can't hardly be used...
(I'm documenting what I do for my own future benefit.)
I'll keep adding new features and bench-marking them. Of course, some of them will likely not change much the numbers because this is such an easy example, but anything that doesn't make the results worse can be considered an improvement.
The streaming API is not documented for OpenCL and there are no examples for it. I wonder whether we should move to C++-kernels-which-are-mostly-C-anyway as those are the ones which are fully documented...
Probably yes, at this point opencl is just a marketing keyword, and a host implementation.
Intel for instance gives you an example of creating a pipe-to-host https://www.intel.com/content/www/us/en/programmable/support/support-resources/design-examples/design-software/opencl/host-pipe.html
The only way I've found of doing something similar to the streaming API is by using two separate CU. If this works (it is compiling now) we can try implementing the C++ version and use this as the OpenCL benchmarking point to check the speed.
The nice thing of using 2 CU is that it maps very well to the RAM as each CU deals with one and only one RAM. I am not sure if the hw_emu is trustworthy in this case but it showed a 50% reduction of the speed (which points to the idea that the kernel doing more than one calculation at a time was actually well-founded).
Well, this closes the issue on host-streaming in OpenCL https://github.com/Xilinx/SDAccel_Examples/issues/56#issuecomment-544079527 There are some features not supported for OpenCL kernels so ¯_(ツ)_/¯
As per the email, the examples for the streaming API are not supported in our FPGA so all in all I think we can as well remain in OpenCL.
It is necessary to rethink how to do the ML intergration. The only bottleneck I've found are the random numbers so we might just give those in from the host
I've modified our
host.cc
to be similar to that from the xilinx repository and I've using our ownmakefile
where we can run different kernels by doingmake run TARGET=sw_emu KERNEL=vector_add_xilinx_repo
for now there are two kernels, the simple one which is like the intel and the one from the xilinx repo.
I'm also using
DOUBLE
defined at the moment asdouble
but that way we can check whether the datatype actually makes a difference.(if you already started feel free to force push)