Closed stiber closed 5 years ago
Checked in the intermediate version on the master-refactor branch and verified. (Enable to run CUDA version on the new properties classes. commit: 6763b8511ef661d4eb2dde87a32e774298e45cfe) Run CUDA version simulator with 2 configuration files (100x100 neurons, 100s x2 epoches growth model, and 1000 neurons, 1s izh neuron model) changing number of cluster (1, 2 and 4). Then run CPU version simulator with 1 configuration file (10x10 neurons, 100s x 2 epochs growth model), and verified the results with results of master version. All results are identical. However, the performance of CPU refactor version of simulator is slower than master version (1.5 times slower, but GPU version is about the same).
Interesting. We can do a code review to see if we can see what might cause such a dramatic slowdown from simply reorganizing the code (especially since the change seems like it is more on the GPU side). In principle, it seems like CUDA_CALLABLE should have no impact on the CPU side...
Simulator crushes (cudaErrorIllegalAddress at cudaDeviceSynchronize() call after advanceSynapses()) after moving device synapses functions to C++ class functions. Fermi and its descendants support for C++ virtual functions, function pointers, and ‘new’ and ‘delete’ operators for dynamic object allocation and de-allocation, so this should work. (https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf) . I am blocked by this problem.
The simulator failed to function after "Moved synapses device func to CUDA_CALLABLE" (commit 04b123a5e57d0d294ffa139315e359f23139be9a).
At the commit ce3b61eb2fb6b9edca80a324ceb2a0cf41ab6e2b.
So, shall we conclude that this is another feature of CUDA that NVIDIA has oversold — something that, if it works, only works in simple cases? I think that, while this is a nice cleanup of our architecture, there are likely more pressing functional matters to consider.
The commitment (04224e4520c7dbaa0524c2d3cf2b00d61afc1204) fixed the problem no. 5 above. Now we still have the problem no. 4 above. I run "./growth_cuda -t validation/test-large-conected_new.xml" with cuda-memcheck and found that there are illegal memory access at AllDSSynapses::changePSR() function.
========= CUDA-MEMCHECK ========= Invalid global read of size 4 ========= at 0x00000478 in AllDSSynapses::changePSR(unsigned int, float, unsigned long, AllSpikingSynapsesProps*) ========= by thread (47,0,0) in block (27,0,0) ========= Address 0x15c8469a00 is out of bounds
The refactor version of growth_cuda starts working on otachi which was compiled with CUDA 9. It looks like CUDA 9 solved the issue. To confirm that, we will implement CUDA 9 on raiju and verify it's running. The performance of the refactor growth_cuda was slower. We run master and refactor growth_cuda on 100x100 neurons with 100s*2 epochs simulation. The duration of the master growth_cuda was 256s, and the duration of the refactor growth_cuda was 333s. Also the simulation output was not identical. We need to investigate this.
Confirmed that the refactor version of growth_cuda works on raiju with CUDA 9.
Run growth_cuda (GPU) with 100x100 LIF neurons and DS synapses dynamic connections model, 100s x 2 epochs. Master version on raiju ----- 590s (1 GPU), 477s (2 GPUs) Refactor version on raiju ----- 867s (1 GPU), 636s (2 GPUs) Master version on otachi ----- 162s (1 GPU), 184s (2 GPUs) Refactoir version on otachi ----- 226s (1 GPU), 229s (2 GPUs)
Run growth_cuda (GPU) with 1,000 IZH neurons and spiking synapses static connection model, 1s x 1 epochs. Master version on raiju ----- 35s (1 GPU) Refactor version on raiju ----- 40s (1 GPU) Master version on otachi ----- 13s (1 GPU) Refactoir version on otachi ----- 9s (1 GPU)
Run growth (CPU) with 10x10 LIF neurons and DS synapses dynamic connection model, 100s x 2 epochs. Master version on raiju ----- 219s Refactor version on raiju ----- 224s Master version on otachi ----- 220s Refactoir version on otachi ----- 234s
Difference between LIF/DSS and IZH/SS: To me, this suggests that the performance hit might just be in the neurons, not the synapses.
Difference between single and multi-GPU on otachi: Seems like our simulation just isn't big enough to merit multiple V100s.
Speedup between raiju and otachi: A slight bit less than a 75% speedup for both master and refactor.
The master-refactor branch is now merged into the master branch.
What kind of issue is this?
Right now, we have C++ class hierarchies and, in the .cpp files, both the CPU side generic code and the CPU side simulator code. Then, in a separate _d.cu file, we have all of the GPU simulator code. The goal of this issue is to use the CUDA_CALLABLE macro to produce integrated function source code that contains both the CPU and GPU simulation code. So, the .cu files will go away, and the GPU simulation code will all be in the .cpp files, with the class of object being simulated.