UWB-Biocomputing / BrainGrid

A project to facilitate construction of high-performance neural simulations.
https://uwb-biocomputing.github.io/BrainGrid/
Apache License 2.0
32 stars 17 forks source link

growth_cuda crashes on raiju #137

Closed stiber closed 8 years ago

stiber commented 8 years ago

I wanted to do a quick comparison of BG runtime on raiju, to compare to our historical experiences with hydra. I used the following command line:

time ../growth_cuda -t tR_1.0--fE_0.90_10000.xml 

It appears that this is at the point where it is doing a cuda_memcopy.

Done with simulation cycle, beginning growth update 12
an illegal memory access was encountered in ./cuda/AllIFNeurons_d.cu at line 249

Do we have a version of BG that runs on raiju? Would be good to determine if this is the case before 5/18. FWIW, it appears that raiju may be 5x as fast as hydra, which would give us around 100x speedup.

apw6 commented 8 years ago

Would the medium-100 test in validation be sufficient input?

stiber commented 8 years ago

That's a good question. Ideally, I'd like to do a "full blown" simulation. We know that the 10,000 neuron simulations took 1-2 weeks, depending on the data we captured. It seems like raiju should take maybe 12-24 hours, so this would be impressive. But it appears that the simulation crashes on raiju when the first synapses are created.

I just tried a 100-neuron simulation:

time ../growth_cuda -t tR_1.0--fE_0.90.xml

And got the same crash after 12 100-second epochs.

stiber commented 8 years ago

Note that this is a clone of the latest from refactor-stable-cuda.

apw6 commented 8 years ago

Given that it crashes on a memcpy, I tried running cuda-memcheck on our executable, and it seems quite upset. It repeatedly says that it cudaLaunch() is returning error 9, which is cudaErrorInvalidConfiguration. The API description of said error is "This indicates that a kernel launch is requesting resources that can never be satisfied by the current device. Requesting more shared memory per block than the device supports will trigger this error, as will requesting too many threads or blocks. See cudaDeviceProp for more device limitations." It also lists addresses of host frames, I presume where these supposed errors are being generated from, but I'm currently at a loss as to how to convert those to lines of source code.

apw6 commented 8 years ago

The source information should actually be readable from the memcheck if the proper debugging symbols are included during compilation. Section 2.4 in the cuda-memcheck api (http://docs.nvidia.com/cuda/cuda-memcheck/index.html#compilation-options) talks about these options. However, my attempts to incorporate them into the Makefile have been unsuccessful. I may require instruction.

fumik commented 8 years ago

I found the cause of the problem and finding now. I will report you when I’m done.

Thanks,

2016/05/13 17:28、Andrew Watson notifications@github.com のメール:

The source information should actually be readable from the memcheck if the proper debugging symbols are included during compilation. Section 2.4 in the cuda-memcheck api (http://docs.nvidia.com/cuda/cuda-memcheck/index.html#compilation-options http://docs.nvidia.com/cuda/cuda-memcheck/index.html#compilation-options) talks about these options. However, my attempts to incorporate them into the Makefile have been unsuccessful. I may require instruction.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/UWB-Biocomputing/BrainGrid/issues/137#issuecomment-219188852

fumik commented 8 years ago

2016/05/16 11:29、Fumitaka Kawasaki fumik@shisho2.com のメール:

I found the cause of the problem and finding now. -> I’m fixing now.

I will report you when I’m done.

Thanks,

2016/05/13 17:28、Andrew Watson <notifications@github.com mailto:notifications@github.com> のメール:

The source information should actually be readable from the memcheck if the proper debugging symbols are included during compilation. Section 2.4 in the cuda-memcheck api (http://docs.nvidia.com/cuda/cuda-memcheck/index.html#compilation-options http://docs.nvidia.com/cuda/cuda-memcheck/index.html#compilation-options) talks about these options. However, my attempts to incorporate them into the Makefile have been unsuccessful. I may require instruction.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/UWB-Biocomputing/BrainGrid/issues/137#issuecomment-219188852

fumik commented 8 years ago

Illegal memory access (out of bound) at the device function changeDSSynapsePSR causes the crashes. It seems that iSync (index of synapses) values may exceed the limit. I added the assertion to check the index value, but it doesn’t hit the assertion. Also the assertion magically fix the problem. Similar thing happened when I added printf statement in the device function. I run growth_cuda w/test-small-connected.xml on cssgpu01 and cssgpu02p and got the identical results. This is a work-around of the issue, and we need to figure out the cause of this problem.

fumik commented 8 years ago

cuda-memcheck reported the followings repeatedly by every thread.

========= CUDA-MEMCHECK ========= Invalid global read of size 4 ========= at 0x000004b8 in changeDSSynapsePSR(AllDSSynapses, unsigned int, unsigned long, float) ========= by thread (193,0,0) in block (0,0,0) ========= Address 0x13c13910d8 is out of bounds ========= Device Frame:advanceSpikingSynapsesDevice(int, SynapseIndexMap, unsigned long, float, AllSpikingSynapses, void () (AllSpikingSynapses, unsigned int, unsigned long, float)) (advanceSpik ingSynapsesDevice(int, SynapseIndexMap, unsigned long, float, AllSpikingSynapses, void () (AllSpik ingSynapses*, unsigned int, unsigned long, float)) : 0x1f0) ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x15859d] ========= Host Frame:/usr/local/cuda-7.5/targets/x86_64-linux/lib/libcudart.so.7.5 [0x146ad] ========= Host Frame:/usr/local/cuda-7.5/targets/x86_64-linux/lib/libcudart.so.7.5 (cudaLaunch + 0x143) [0x2ece3] ========= Host Frame:./growth_cuda [0x23d30] ========= Host Frame:./growth_cuda [0x23ab7] ========= Host Frame:./growth_cuda [0x23afc] ========= Host Frame:./growth_cuda [0x23746] ========= Host Frame:./growth_cuda [0x16081] ========= Host Frame:./growth_cuda [0x7499] ========= Host Frame:./growth_cuda [0x7672] ========= Host Frame:./growth_cuda [0x645a] ========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]

========= Host Frame:./growth_cuda [0x6bf1]

It tried to read out of bound memory region. Most likely we have an invalid synapse index and refer to the illegal memory address. However as I mentioned before, assertion didn't catch the error. I also added -G nvcc option (device code debug option) and compiled and run. This time it worked OK. Therefore I suspected that there might be some timing issue involved. One scenario is that SynapseIndexMap was corrupted because of the potential concurrency between device and hots code. So to synchronize between device and host code, I added cuda api cudaDeviceSynchronize after every kernel function call.

However, this didn't fix the problem. Needs more investigation.

stiber commented 8 years ago

Verification activities today:

These indicate no problem, but device threads are still crashing in changeDSSynapsePSR().

Next steps:

Right now, it really seems like the value of iSyn in changeDSSynapsePSR() is somehow not getting loaded into internal processor registers. Doing something like assert() or printf() seems to force this. Merely accessing iSyn, like as an array index, doesn't seem to do this. Using -G option to nvcc seems to turn off some optimization that's causing this. Assuming there's no problem with the index map being copied to the device, then maybe we need to look at the PTX code in changeDSSynapsePSR() to see what changes.

fumik commented 8 years ago

Done: The results were identical and still caused the crash.

fumik commented 8 years ago

I suspected that there is an issue in calling a device function in different modules using a function pointer. (see https://devtalk.nvidia.com/default/topic/543152/consistency-of-functions-pointer/?offset=6) So I modified and checked:

  1. Call changeDSSynapsePSR device function directly from advanceSpikingSynapseDevice kernel function. However, this also causes crash in changeDSSynapsePSR.
  2. Move all global and device functions about synapses in one single module. Crash also.
  3. And call changeDSSynapsePSR directly. Now no crash. If the problem is caused by the cross device function call using function pointer, the case 1. above should work. So we need more investigation.
fumik commented 8 years ago

I set BGFLOAT to double and run the growth_cuda on raiju. It does not crash. Also added -G NVCC option didn't cause the crash.

fumik commented 8 years ago

Compiling growth_cuda with NVCC release 7.5, V7.5.17 on cssgpu01 caused the crash. Is it worth to try new CUDA 8 toolkit?

stiber commented 8 years ago

Since CUDA 8 is not release code, and almost assuredly would require installing the 8.0 device driver, I think it's unwise to go to this. Also, it seems that replacing references with local variables "fixes" the problem. So, let's got with that: take the references out of all GPU-side code (maybe testing a file or so at a time, to make sure that this doesn't introduce any new problems).

fumik commented 8 years ago

I changed changeDSSynapsePSR() device function in AllDSSynapses_d.cu where the invalid memory read happened not to use references. Then it worked. Then I replaced references with local variables in advanceSpikingSynapsesDevice() kernel function in AllSpikingSynapses_d.cu() where changeDSSynapsePSR() device function calls as below.

395 global void advanceSpikingSynapsesDevice ( int total_synapse_counts, SynapseIndexMap* synapse IndexMapDevice, uint64_t simulationStep, const BGFLOAT deltaT, AllSpikingSynapses* allSynapsesDev ice, void (fpChangePSR)(AllSpikingSynapses, const BGSIZE, const uint64_t, const BGFLOAT) ) { 396 int idx = blockIdx.x * blockDim.x + threadIdx.x; 397 if ( idx >= total_synapse_counts ) 398 return; 399 400 BGSIZE iSyn = synapseIndexMapDevice->activeSynapseIndex[idx]; 401 402 BGFLOAT psr = allSynapsesDevice->psr[iSyn]; 403 BGFLOAT decay = allSynapsesDevice->decay[iSyn]; 404 405 // Checks if there is an input spike in the queue. 406 bool isFired = isSpikingSynapsesSpikeQueueDevice(allSynapsesDevice, iSyn); 407 408 // is an input in the queue? 409 if (isFired) { 410 fpChangePSR(allSynapsesDevice, iSyn, simulationStep, deltaT); 411 } 412 // decay the post spike response 413 psr *= decay; 414 415 // write back all l-values in local variables 416 allSynapsesDevice->psr[iSyn] = psr; 417 }

Then I got the following error.

========= CUDA-MEMCHECK ========= Invalid global write of size 4 ========= at 0x00000238 in /home/NETID/fumik/BrainGrid/BrainGrid/./Synapses/AllSpikingSynapsesd. cu:416:advanceSpikingSynapsesDevice(int, SynapseIndexMap, unsigned long, float, AllSpikingSynapses, void () (AllSpikingSynapses_, unsigned int, unsigned long, float)) ========= by thread (193,0,0) in block (0,0,0) ========= Address 0x131554a000 is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x15859d] ========= Host Frame:/usr/local/cuda-7.5/targets/x86_64-linux/lib/libcudart.so.7.5 [0x146ad] ========= Host Frame:/usr/local/cuda-7.5/targets/x86_64-linux/lib/libcudart.so.7.5 (cudaLaunch + 0x143) [0x2ece3] ========= Host Frame:./growth_cuda [0x23c30] ========= Host Frame:./growth_cuda [0x239b7] ========= Host Frame:./growth_cuda [0x239fc] ========= Host Frame:./growth_cuda [0x23646] ========= Host Frame:./growth_cuda [0x1600b] ========= Host Frame:./growth_cuda [0x7439] ========= Host Frame:./growth_cuda [0x7612] ========= Host Frame:./growth_cuda [0x63fa] ========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15] ========= Host Frame:./growth_cuda [0x6b91]

stiber commented 8 years ago

Interesting that this happens with the write on line 416. Just because I'm anal, I would suggest changing the local variable "psr", declared and initialized on line 402, to something like "localPSR". Might as well do the same with decay. It shouldn't matter (though may be an improvement in human readability), but since this doesn't make sense anyway, we should try.

We may need to consider downgrading to the version 6 SDK. It would be interesting to try the v6 tools with the v7 drivers first, but worst case, we could downgrade both. Something to talk about at our next meeting.

stiber commented 8 years ago

Here's a suggestion from the NVIDIA discussions; something easy to try:

From the totality of the symptoms described, it sounds like a compiler bug may be in play here. You may also want to check for undefined, or implementation-defined, C/C++ behavior in the code, as that can be the cause of latent bugs that may then be exposed by compiler changes.

For a quick experiment, and potential workaround while you wait for resolution of your bug report with NVIDIA, I would suggest reducing the PTXAS optimization level. The default is -O3. Try to reducing it to a less aggressive setting with -Xptxas -O2, then -Xptxas -O1 if that does not help, finally -Xptxas -O0. If that makes the issue disappear, it usually does so with only a modest loss of performance, as all the high-level optimizations are still applied by NVVM.

fumik commented 8 years ago

Adding the -Xptxas -O0 flags did not fix the issue.

fumik commented 8 years ago

I tried the followings.

  1. Make a separate data structure for each synapse class to store synapse properties on device memory. (currently we use synapse classes where both methods and properties are defined.)
  2. Avoid inheritance from parent properties.
  3. Eliminate any Polymorphism mechanism in kernel and device functions. (that is, make individual function for each synapse class even if most of codes are duplicate.)

Result: Only no. 3 above fix the issue.

So as I mentioned before, only safe way to fix the issue is calling device function directly (not though function pointer) in the same module.

fumik commented 8 years ago

The latest commitment (3762b1fae5a5f4f1aba63f079d6f2f8cff12904a) on issue137 branch is stable for benchmark, where:

  1. Made separate data structure for neurons and synapses to store properties on device memory. (Previously we stored these properties in class structure where data as well as member function pointers would be stored and these function pointers were not used in kernel and device functions.)
  2. Added advanceSynapses() class member functions to AllDSSynapses and AllDynamicSTDPSynapses classes and its corresponding kernel functions. Then the kernel functions call changeXXXPSR() device function in the same module directly. (Eliminate function call through function pointers.)
  3. I confirmed that the latest build generated the same results as the ones of main branch.

Some notes:

  1. For small network such as 10x10, the simulation times between cssgpu01 and cssgpu02p are almost the same.
  2. For large network such as 100x100, cssgpu02p is faster than cssgpu01. (almost twice faster.)
  3. We still use function pointers for creating synapses and pre and postSpikeHit function calls. We can not call these function directly because these function are called from neurons and/or connection classes (not synapses classes.) so these classes don't know which class kernel functions to be called.
  4. Currently these function seems to work, but these functions have potential same problems. So we need to consider another way to implement this.

So summarize the current question:

  1. How to implement a device function call from kernel function. The device function cannot be identified at linking time. Currently this is done by using function pointers.
stiber commented 8 years ago

OK, I'm marking this as resolved; probably need to capture some of this discussion in the documentation.