Refactor all device functions into common C++ files

stiber commented 6 years ago

What kind of issue is this?

[ ] Bug report
[ ] Feature request
[ ] Question not answered in documentation
[X] Cleanup needed

Right now, we have C++ class hierarchies and, in the .cpp files, both the CPU side generic code and the CPU side simulator code. Then, in a separate _d.cu file, we have all of the GPU simulator code. The goal of this issue is to use the CUDA_CALLABLE macro to produce integrated function source code that contains both the CPU and GPU simulation code. So, the .cu files will go away, and the GPU simulation code will all be in the .cpp files, with the class of object being simulated.

fumik commented 5 years ago

Checked in the intermediate version on the master-refactor branch and verified. (Enable to run CUDA version on the new properties classes. commit: 6763b8511ef661d4eb2dde87a32e774298e45cfe) Run CUDA version simulator with 2 configuration files (100x100 neurons, 100s x2 epoches growth model, and 1000 neurons, 1s izh neuron model) changing number of cluster (1, 2 and 4). Then run CPU version simulator with 1 configuration file (10x10 neurons, 100s x 2 epochs growth model), and verified the results with results of master version. All results are identical. However, the performance of CPU refactor version of simulator is slower than master version (1.5 times slower, but GPU version is about the same).

stiber commented 5 years ago

Interesting. We can do a code review to see if we can see what might cause such a dramatic slowdown from simply reorganizing the code (especially since the change seems like it is more on the GPU side). In principle, it seems like CUDA_CALLABLE should have no impact on the CPU side...

fumik commented 5 years ago

Simulator crushes (cudaErrorIllegalAddress at cudaDeviceSynchronize() call after advanceSynapses()) after moving device synapses functions to C++ class functions. Fermi and its descendants support for C++ virtual functions, function pointers, and ‘new’ and ‘delete’ operators for dynamic object allocation and de-allocation, so this should work. (https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf) . I am blocked by this problem.

fumik commented 5 years ago

The simulator failed to function after "Moved synapses device func to CUDA_CALLABLE" (commit 04b123a5e57d0d294ffa139315e359f23139be9a).

Run "./growth_cuda -t validation/test-large-conected_new.xml". After stating the second epoch, it caused the error "CUDA error at ./Core/GPUSpikingCluster.cu:500 code=77(cudaErrorIllegalAddress) "cudaDeviceSynchronize()".
Run "./growth_cuda -t validation/static_izh_1000_new_ttt.xml". It worked OK, and generated a valid result.
Added "-G" nvcc option ("Generate debug information for device code, turns off all optimization on device code"). It didn't cause the error, but generated an invalid result, and also running time was longer.

At the commit ce3b61eb2fb6b9edca80a324ceb2a0cf41ab6e2b.

Run "./growth_cuda -t validation/test-large-conected_new.xml". After "updateSynapsesWeight", It caused the error "CUDA error at ./Synapses/AllSpikingSynapsesProps.cpp:278 code=77(cudaErrorIllegalAddress) "cudaMemcpy ( &allSynapsesProps, allSynapsesDeviceProps, sizeof( AllSpikingSynapsesProps ), cudaMemcpyDeviceToHost )"
Run "./growth_cuda -t validation/static_izh_1000_new_ttt.xml". After "setupConnections", it caused the error "CUDA error at ./Connections/ConnStatic_d.cu:111 code=77(cudaErrorIllegalAddress) "cudaFree( rDistDestNeuron_d )"

stiber commented 5 years ago

So, shall we conclude that this is another feature of CUDA that NVIDIA has oversold — something that, if it works, only works in simple cases? I think that, while this is a nice cleanup of our architecture, there are likely more pressing functional matters to consider.

fumik commented 5 years ago

The commitment (04224e4520c7dbaa0524c2d3cf2b00d61afc1204) fixed the problem no. 5 above. Now we still have the problem no. 4 above. I run "./growth_cuda -t validation/test-large-conected_new.xml" with cuda-memcheck and found that there are illegal memory access at AllDSSynapses::changePSR() function.

========= CUDA-MEMCHECK ========= Invalid global read of size 4 ========= at 0x00000478 in AllDSSynapses::changePSR(unsigned int, float, unsigned long, AllSpikingSynapsesProps*) ========= by thread (47,0,0) in block (27,0,0) ========= Address 0x15c8469a00 is out of bounds

fumik commented 5 years ago

The refactor version of growth_cuda starts working on otachi which was compiled with CUDA 9. It looks like CUDA 9 solved the issue. To confirm that, we will implement CUDA 9 on raiju and verify it's running. The performance of the refactor growth_cuda was slower. We run master and refactor growth_cuda on 100x100 neurons with 100s*2 epochs simulation. The duration of the master growth_cuda was 256s, and the duration of the refactor growth_cuda was 333s. Also the simulation output was not identical. We need to investigate this.

fumik commented 5 years ago

Confirmed that the refactor version of growth_cuda works on raiju with CUDA 9.

fumik commented 5 years ago

Validation results:

I confirmed that the master and the refactor version of growth_cuda and growth generated identical outputs (1,000 IZH neurons and spiking synapses static connection model (GPU), 100x100 LIF neurons and DS synapses dynamic connections model (GPU), and 10x10 LIF neurons and DS synapses dynamic connection model (CPU)).
However growth_cuda generated different outputs between on raiju and on otachi with 100x100 LIF neurons and DS synapses dynamic connections model.

Performance evaluation results:

Run growth_cuda (GPU) with 100x100 LIF neurons and DS synapses dynamic connections model, 100s x 2 epochs. Master version on raiju ----- 590s (1 GPU), 477s (2 GPUs) Refactor version on raiju ----- 867s (1 GPU), 636s (2 GPUs) Master version on otachi ----- 162s (1 GPU), 184s (2 GPUs) Refactoir version on otachi ----- 226s (1 GPU), 229s (2 GPUs)

Run growth_cuda (GPU) with 1,000 IZH neurons and spiking synapses static connection model, 1s x 1 epochs. Master version on raiju ----- 35s (1 GPU) Refactor version on raiju ----- 40s (1 GPU) Master version on otachi ----- 13s (1 GPU) Refactoir version on otachi ----- 9s (1 GPU)

Run growth (CPU) with 10x10 LIF neurons and DS synapses dynamic connection model, 100s x 2 epochs. Master version on raiju ----- 219s Refactor version on raiju ----- 224s Master version on otachi ----- 220s Refactoir version on otachi ----- 234s

Refactor version of growth_cuda is around 1.4~1.5 times slower than the master version of growth_cuda with 100x100 LIF neurons and DS synapses dynamic connections model, 100s x 2 epochs.
No significant performance difference between the refactor and the master version of growth_cuda with 1,000 IZH neurons and spiking synapses static connection model, 1s x 1 epochs. On otachi refactor version is faster.
No significant performance difference between the refactor and the master version of growth with 10x10 LIF neurons and DS synapses dynamic connection model, 100s x 2 epochs, also no significant performance difference between the raiju and otachi.
2 GPUs outperformed on raiju, but slower on otachi.

stiber commented 5 years ago

Difference between LIF/DSS and IZH/SS: To me, this suggests that the performance hit might just be in the neurons, not the synapses.

Difference between single and multi-GPU on otachi: Seems like our simulation just isn't big enough to merit multiple V100s.

Speedup between raiju and otachi: A slight bit less than a 75% speedup for both master and refactor.

fumik commented 5 years ago

The master-refactor branch is now merged into the master branch.

UWB-Biocomputing / BrainGrid