One (opencl) kernel to run them all

scarlehoff commented 5 years ago

There is a mccl.h which is inspired on the header from Xilinx with a few extra goodies. I am doing all of this in a different folder from FPGA since I don't know how destructive I am being until I don't port the makefile...

scarrazza commented 5 years ago

Looks good.

scarlehoff commented 5 years ago

I think the problems with the TARGET=hw are not due to the code because I am not able to run the old version either.

scarrazza commented 5 years ago

Yeah, even xbutil validate is failing.

scarrazza commented 5 years ago

OK, after a reboot it seems to work for me, could you please retry?

scarlehoff commented 5 years ago

Ok, now the fpga/ folder works. I am compiling the other one.

scarlehoff commented 5 years ago

Now hw_emu works with EVENTS=10, more than that seems very slow (I've gotten results with up to 100 though). Interestingly enough, with the printf I was able to go up to EVENTS=100, I still don't know what the printf was doing.

Note that the printf was done with the globals inside. It would be interesting to know what changed, maybe it is time to learn VHDL or whatever language the opencl compiles to (if it is anything human-readable)

I changed make check to make run.

scarlehoff commented 5 years ago

Ok, this last commit works, but there are several points I want to higlight: 1 - I tried doing ctrl + C and re-sending the kernel and it worked well. So there was something wrong in the previous one before just taking a long time.

2 - Speed

make run TARGET=hw EVENTS=1000000
It took: 10.3109890 seconds

make run-gpu EVENTS=1000000
It took: 0.4811010 seconds

At first I thought that it was much slower than the GPU, which is sad.... but is it? I have the feeling it is running with just one thread-kernel and it is doing something clever with the loop upon compilation. This means there is a lot of space for improvement because Xilinx didn't know the size of the kernel at compile time.

Tomorrow we can talk more in detail about this. I think the long-story-short is that the knowledge about threads-workers from GPU cannot be ported to FPGA directly and there are more things to take care about.

3 - There are a few problems which needs to be investigated:

[ ] What is the effect of the global keyword and how to fix it without allocating spurious memory
[ ] What is the unaligned pointer error
[ ] Why the FPGA is so much slower, can it be improved or is it "a feature"?
[ ] What was wrong with the previous version that "breaks" the FPGA?

scarrazza commented 5 years ago

OK, the profiling guidance seems to point out that we have several memory issues: http://scarraza.web.cern.ch/scarraza/profile_summary.html

scarlehoff commented 5 years ago

And also the compute units problem that we knew from before.

How many events did you use? If the number of events was low enough maybe some of the low memory usage warnings are just due to not having that many things to pass through memory anyway.

scarlehoff commented 5 years ago

Benchmark with commit bd8bd6d, running always

make clean && make run-gpu EVENTS=100000000

i.e., 10^8 events:

Tesla V 100: 1.8476680 seconds
Titan V: 2.5045410 seconds
Tesla P 100: 3.2197380 seconds
Tesla C2070 (all doubles had to be converted to float): 10.9822969 seconds
2080: 7.2298960 seconds
Dom's CPU: 94.2260180 seconds

None of this gets even to 1GB in the GPU. Also, in the Tesla V 100 10^9 events: 10s 10^10 events: (some changes in the code because it was above MAXINT): 140s

10^9 is the typical number of events that would be used in fixed order calculations. We already have a factor of 100 there and I am guessing it will be even bigger as the number of events grows. I've tried a bit more complicated functions (still nothing derailed) and the performance was not impacted. I am looking forward to understand what the problem with the FPGA is because the whole thing seems quite robust (for GPUs)

scarrazza commented 5 years ago

Interesting. Are you passing -O3/4 to the fpga binary compilation?

scarlehoff commented 5 years ago

No, I am keeping the same flags you had in the other makefile. Do you think it might be that?

Because I really hope the 2hours compilation of the FPGA is doing some kind of optimization...

scarrazza commented 5 years ago

I am playing with the SDx GUI and I see a big box where we select the optimization flag from -O0 to -Oquick, but it is true that in 2h it is applying something clever.

scarlehoff commented 5 years ago

Maybe we should try with that -Oquick flag. It might be that without it it does something very stupid but very good for testing?

scarlehoff commented 5 years ago

After all the changes, with the last commit,

make run TARGET=hw EVENTS=1000000

still takes 9 seconds (before it took 10s)

it is clear neither the memory nor the pragmas were the problem

scarlehoff commented 4 years ago

We can sort of consider the opencl branch closed tbh...

N3PDF / mcgpu

One (opencl) kernel to run them all #14