Open scarlehoff opened 5 years ago
Looks good.
I think the problems with the TARGET=hw are not due to the code because I am not able to run the old version either.
Yeah, even xbutil validate
is failing.
OK, after a reboot it seems to work for me, could you please retry?
Ok, now the fpga/
folder works. I am compiling the other one.
Now hw_emu
works with EVENTS=10
, more than that seems very slow (I've gotten results with up to 100 though).
Interestingly enough, with the printf
I was able to go up to EVENTS=100
, I still don't know what the printf
was doing.
Note that the printf
was done with the globals inside. It would be interesting to know what changed, maybe it is time to learn VHDL or whatever language the opencl compiles to (if it is anything human-readable)
I changed make check
to make run
.
Ok, this last commit works, but there are several points I want to higlight: 1 - I tried doing ctrl + C and re-sending the kernel and it worked well. So there was something wrong in the previous one before just taking a long time.
2 - Speed
make run TARGET=hw EVENTS=1000000
It took: 10.3109890 seconds
make run-gpu EVENTS=1000000
It took: 0.4811010 seconds
At first I thought that it was much slower than the GPU, which is sad.... but is it? I have the feeling it is running with just one thread-kernel and it is doing something clever with the loop upon compilation. This means there is a lot of space for improvement because Xilinx didn't know the size of the kernel at compile time.
Tomorrow we can talk more in detail about this. I think the long-story-short is that the knowledge about threads-workers from GPU cannot be ported to FPGA directly and there are more things to take care about.
3 - There are a few problems which needs to be investigated:
global
keyword and how to fix it without allocating spurious memoryunaligned pointer
errorOK, the profiling guidance seems to point out that we have several memory issues: http://scarraza.web.cern.ch/scarraza/profile_summary.html
And also the compute units problem that we knew from before.
How many events did you use? If the number of events was low enough maybe some of the low memory usage warnings are just due to not having that many things to pass through memory anyway.
Benchmark with commit bd8bd6d, running always
make clean && make run-gpu EVENTS=100000000
i.e., 10^8 events:
None of this gets even to 1GB in the GPU. Also, in the Tesla V 100 10^9 events: 10s 10^10 events: (some changes in the code because it was above MAXINT): 140s
10^9 is the typical number of events that would be used in fixed order calculations. We already have a factor of 100 there and I am guessing it will be even bigger as the number of events grows. I've tried a bit more complicated functions (still nothing derailed) and the performance was not impacted. I am looking forward to understand what the problem with the FPGA is because the whole thing seems quite robust (for GPUs)
Interesting. Are you passing -O3/4 to the fpga binary compilation?
No, I am keeping the same flags you had in the other makefile. Do you think it might be that?
Because I really hope the 2hours compilation of the FPGA is doing some kind of optimization...
I am playing with the SDx GUI and I see a big box where we select the optimization flag from -O0 to -Oquick, but it is true that in 2h it is applying something clever.
Maybe we should try with that -Oquick flag. It might be that without it it does something very stupid but very good for testing?
After all the changes, with the last commit,
make run TARGET=hw EVENTS=1000000
still takes 9 seconds (before it took 10s)
it is clear neither the memory nor the pragmas were the problem
We can sort of consider the opencl branch closed tbh...
There is a
mccl.h
which is inspired on the header from Xilinx with a few extra goodies. I am doing all of this in a different folder fromFPGA
since I don't know how destructive I am being until I don't port the makefile...