Closed stiber closed 8 years ago
Would the medium-100 test in validation be sufficient input?
That's a good question. Ideally, I'd like to do a "full blown" simulation. We know that the 10,000 neuron simulations took 1-2 weeks, depending on the data we captured. It seems like raiju should take maybe 12-24 hours, so this would be impressive. But it appears that the simulation crashes on raiju when the first synapses are created.
I just tried a 100-neuron simulation:
time ../growth_cuda -t tR_1.0--fE_0.90.xml
And got the same crash after 12 100-second epochs.
Note that this is a clone of the latest from refactor-stable-cuda.
Given that it crashes on a memcpy, I tried running cuda-memcheck on our executable, and it seems quite upset. It repeatedly says that it cudaLaunch() is returning error 9, which is cudaErrorInvalidConfiguration. The API description of said error is "This indicates that a kernel launch is requesting resources that can never be satisfied by the current device. Requesting more shared memory per block than the device supports will trigger this error, as will requesting too many threads or blocks. See cudaDeviceProp for more device limitations." It also lists addresses of host frames, I presume where these supposed errors are being generated from, but I'm currently at a loss as to how to convert those to lines of source code.
The source information should actually be readable from the memcheck if the proper debugging symbols are included during compilation. Section 2.4 in the cuda-memcheck api (http://docs.nvidia.com/cuda/cuda-memcheck/index.html#compilation-options) talks about these options. However, my attempts to incorporate them into the Makefile have been unsuccessful. I may require instruction.
I found the cause of the problem and finding now. I will report you when I’m done.
Thanks,
2016/05/13 17:28、Andrew Watson notifications@github.com のメール:
The source information should actually be readable from the memcheck if the proper debugging symbols are included during compilation. Section 2.4 in the cuda-memcheck api (http://docs.nvidia.com/cuda/cuda-memcheck/index.html#compilation-options http://docs.nvidia.com/cuda/cuda-memcheck/index.html#compilation-options) talks about these options. However, my attempts to incorporate them into the Makefile have been unsuccessful. I may require instruction.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/UWB-Biocomputing/BrainGrid/issues/137#issuecomment-219188852
2016/05/16 11:29、Fumitaka Kawasaki fumik@shisho2.com のメール:
I found the cause of the problem and finding now. -> I’m fixing now.
I will report you when I’m done.
Thanks,
2016/05/13 17:28、Andrew Watson <notifications@github.com mailto:notifications@github.com> のメール:
The source information should actually be readable from the memcheck if the proper debugging symbols are included during compilation. Section 2.4 in the cuda-memcheck api (http://docs.nvidia.com/cuda/cuda-memcheck/index.html#compilation-options http://docs.nvidia.com/cuda/cuda-memcheck/index.html#compilation-options) talks about these options. However, my attempts to incorporate them into the Makefile have been unsuccessful. I may require instruction.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/UWB-Biocomputing/BrainGrid/issues/137#issuecomment-219188852
Illegal memory access (out of bound) at the device function changeDSSynapsePSR causes the crashes. It seems that iSync (index of synapses) values may exceed the limit. I added the assertion to check the index value, but it doesn’t hit the assertion. Also the assertion magically fix the problem. Similar thing happened when I added printf statement in the device function. I run growth_cuda w/test-small-connected.xml on cssgpu01 and cssgpu02p and got the identical results. This is a work-around of the issue, and we need to figure out the cause of this problem.
cuda-memcheck reported the followings repeatedly by every thread.
========= CUDA-MEMCHECK ========= Invalid global read of size 4 ========= at 0x000004b8 in changeDSSynapsePSR(AllDSSynapses, unsigned int, unsigned long, float) ========= by thread (193,0,0) in block (0,0,0) ========= Address 0x13c13910d8 is out of bounds ========= Device Frame:advanceSpikingSynapsesDevice(int, SynapseIndexMap, unsigned long, float, AllSpikingSynapses, void () (AllSpikingSynapses, unsigned int, unsigned long, float)) (advanceSpik ingSynapsesDevice(int, SynapseIndexMap, unsigned long, float, AllSpikingSynapses, void () (AllSpik ingSynapses*, unsigned int, unsigned long, float)) : 0x1f0) ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x15859d] ========= Host Frame:/usr/local/cuda-7.5/targets/x86_64-linux/lib/libcudart.so.7.5 [0x146ad] ========= Host Frame:/usr/local/cuda-7.5/targets/x86_64-linux/lib/libcudart.so.7.5 (cudaLaunch + 0x143) [0x2ece3] ========= Host Frame:./growth_cuda [0x23d30] ========= Host Frame:./growth_cuda [0x23ab7] ========= Host Frame:./growth_cuda [0x23afc] ========= Host Frame:./growth_cuda [0x23746] ========= Host Frame:./growth_cuda [0x16081] ========= Host Frame:./growth_cuda [0x7499] ========= Host Frame:./growth_cuda [0x7672] ========= Host Frame:./growth_cuda [0x645a] ========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
It tried to read out of bound memory region. Most likely we have an invalid synapse index and refer to the illegal memory address. However as I mentioned before, assertion didn't catch the error. I also added -G nvcc option (device code debug option) and compiled and run. This time it worked OK. Therefore I suspected that there might be some timing issue involved. One scenario is that SynapseIndexMap was corrupted because of the potential concurrency between device and hots code. So to synchronize between device and host code, I added cuda api cudaDeviceSynchronize after every kernel function call.
However, this didn't fix the problem. Needs more investigation.
Verification activities today:
These indicate no problem, but device threads are still crashing in changeDSSynapsePSR().
Next steps:
Right now, it really seems like the value of iSyn in changeDSSynapsePSR() is somehow not getting loaded into internal processor registers. Doing something like assert() or printf() seems to force this. Merely accessing iSyn, like as an array index, doesn't seem to do this. Using -G option to nvcc seems to turn off some optimization that's causing this. Assuming there's no problem with the index map being copied to the device, then maybe we need to look at the PTX code in changeDSSynapsePSR() to see what changes.
Done: The results were identical and still caused the crash.
I suspected that there is an issue in calling a device function in different modules using a function pointer. (see https://devtalk.nvidia.com/default/topic/543152/consistency-of-functions-pointer/?offset=6) So I modified and checked:
I set BGFLOAT to double and run the growth_cuda on raiju. It does not crash. Also added -G NVCC option didn't cause the crash.
Compiling growth_cuda with NVCC release 7.5, V7.5.17 on cssgpu01 caused the crash. Is it worth to try new CUDA 8 toolkit?
Since CUDA 8 is not release code, and almost assuredly would require installing the 8.0 device driver, I think it's unwise to go to this. Also, it seems that replacing references with local variables "fixes" the problem. So, let's got with that: take the references out of all GPU-side code (maybe testing a file or so at a time, to make sure that this doesn't introduce any new problems).
I changed changeDSSynapsePSR() device function in AllDSSynapses_d.cu where the invalid memory read happened not to use references. Then it worked. Then I replaced references with local variables in advanceSpikingSynapsesDevice() kernel function in AllSpikingSynapses_d.cu() where changeDSSynapsePSR() device function calls as below.
395 global void advanceSpikingSynapsesDevice ( int total_synapse_counts, SynapseIndexMap* synapse IndexMapDevice, uint64_t simulationStep, const BGFLOAT deltaT, AllSpikingSynapses* allSynapsesDev ice, void (fpChangePSR)(AllSpikingSynapses, const BGSIZE, const uint64_t, const BGFLOAT) ) { 396 int idx = blockIdx.x * blockDim.x + threadIdx.x; 397 if ( idx >= total_synapse_counts ) 398 return; 399 400 BGSIZE iSyn = synapseIndexMapDevice->activeSynapseIndex[idx]; 401 402 BGFLOAT psr = allSynapsesDevice->psr[iSyn]; 403 BGFLOAT decay = allSynapsesDevice->decay[iSyn]; 404 405 // Checks if there is an input spike in the queue. 406 bool isFired = isSpikingSynapsesSpikeQueueDevice(allSynapsesDevice, iSyn); 407 408 // is an input in the queue? 409 if (isFired) { 410 fpChangePSR(allSynapsesDevice, iSyn, simulationStep, deltaT); 411 } 412 // decay the post spike response 413 psr *= decay; 414 415 // write back all l-values in local variables 416 allSynapsesDevice->psr[iSyn] = psr; 417 }
Then I got the following error.
========= CUDA-MEMCHECK ========= Invalid global write of size 4 ========= at 0x00000238 in /home/NETID/fumik/BrainGrid/BrainGrid/./Synapses/AllSpikingSynapsesd. cu:416:advanceSpikingSynapsesDevice(int, SynapseIndexMap, unsigned long, float, AllSpikingSynapses, void () (AllSpikingSynapses_, unsigned int, unsigned long, float)) ========= by thread (193,0,0) in block (0,0,0) ========= Address 0x131554a000 is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x15859d] ========= Host Frame:/usr/local/cuda-7.5/targets/x86_64-linux/lib/libcudart.so.7.5 [0x146ad] ========= Host Frame:/usr/local/cuda-7.5/targets/x86_64-linux/lib/libcudart.so.7.5 (cudaLaunch + 0x143) [0x2ece3] ========= Host Frame:./growth_cuda [0x23c30] ========= Host Frame:./growth_cuda [0x239b7] ========= Host Frame:./growth_cuda [0x239fc] ========= Host Frame:./growth_cuda [0x23646] ========= Host Frame:./growth_cuda [0x1600b] ========= Host Frame:./growth_cuda [0x7439] ========= Host Frame:./growth_cuda [0x7612] ========= Host Frame:./growth_cuda [0x63fa] ========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15] ========= Host Frame:./growth_cuda [0x6b91]
Interesting that this happens with the write on line 416. Just because I'm anal, I would suggest changing the local variable "psr", declared and initialized on line 402, to something like "localPSR". Might as well do the same with decay. It shouldn't matter (though may be an improvement in human readability), but since this doesn't make sense anyway, we should try.
We may need to consider downgrading to the version 6 SDK. It would be interesting to try the v6 tools with the v7 drivers first, but worst case, we could downgrade both. Something to talk about at our next meeting.
Here's a suggestion from the NVIDIA discussions; something easy to try:
From the totality of the symptoms described, it sounds like a compiler bug may be in play here. You may also want to check for undefined, or implementation-defined, C/C++ behavior in the code, as that can be the cause of latent bugs that may then be exposed by compiler changes.
For a quick experiment, and potential workaround while you wait for resolution of your bug report with NVIDIA, I would suggest reducing the PTXAS optimization level. The default is -O3. Try to reducing it to a less aggressive setting with -Xptxas -O2, then -Xptxas -O1 if that does not help, finally -Xptxas -O0. If that makes the issue disappear, it usually does so with only a modest loss of performance, as all the high-level optimizations are still applied by NVVM.
Adding the -Xptxas -O0 flags did not fix the issue.
I tried the followings.
Result: Only no. 3 above fix the issue.
So as I mentioned before, only safe way to fix the issue is calling device function directly (not though function pointer) in the same module.
The latest commitment (3762b1fae5a5f4f1aba63f079d6f2f8cff12904a) on issue137 branch is stable for benchmark, where:
Some notes:
So summarize the current question:
OK, I'm marking this as resolved; probably need to capture some of this discussion in the documentation.
I wanted to do a quick comparison of BG runtime on raiju, to compare to our historical experiences with hydra. I used the following command line:
It appears that this is at the point where it is doing a cuda_memcopy.
Do we have a version of BG that runs on raiju? Would be good to determine if this is the case before 5/18. FWIW, it appears that raiju may be 5x as fast as hydra, which would give us around 100x speedup.