NervanaSystems / neon

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware
http://neon.nervanasys.com/docs/latest
Apache License 2.0
3.87k stars 811 forks source link

segment fault when import NervanaGPU #413

Closed xingjinglu closed 6 years ago

xingjinglu commented 6 years ago

The platform information is as follow: 1) Ubuntu 16.04 2) Cuda: cuda 9.0 3) GPU: TITAN Xp 4) install pycuda with: pip3 install pycuda 5) python version: Python 3.5.2

from neon.backends.nervanagpu import NervanaGPU Segmentation fault (core dumped)

Any suggestions about the error?

Thanks!

xingjinglu commented 6 years ago

It seems the python-pycuda not work with cuda9.0.

I test the pycuda with the test case: python test_driver.py

It causes segmentation fualt either.

I fill check the reason further.

xingjinglu commented 6 years ago

These errors happens when I install within docker. Then I tried to install pycuda and neon without docker, and it works.

but When I tried run the benchmark of openai-gemm, it introduces other errors, such as follow:

Are there some suggestions about it?

(.venv2) [@nmyjs_186_118 openai-gemm]$ ./benchmark.py TITAN Xp M N K Op OpenAI_32 cuBLAS_32 ratio_32 OpenAI_16 cuBLAS_16 ratio_16

Traceback (most recent call last): File "./benchmark.py", line 126, in data = matmul(A, B, C, bench=True) File "/search/odin/luxingjing/project/openai-gemm/openai_gemm.py", line 73, in matmul kernel, params, dynamic_shared = _get_gemm_kernel(prefix, op, cda, cdb, cdc, m, n, k) File "", line 2, in _get_gemm_kernel File "/search/odin/luxingjing/project/neon/.venv2/lib/python2.7/site-packages/pycuda/tools.py", line 430, in context_dependent_memoize result = func(args) File "/search/odin/luxingjing/project/openai-gemm/openai_gemm.py", line 280, in _get_gemm_kernel kernel = get_kernel(base, opts) File "", line 2, in get_kernel File "/search/odin/luxingjing/project/neon/.venv2/lib/python2.7/site-packages/pycuda/tools.py", line 430, in context_dependent_memoize result = func(args) File "/search/odin/luxingjing/project/openai-gemm/openai_gemm.py", line 693, in get_kernel run_command([ "ptxas -v -arch", arch, "-o", cubin_file, ptx_file, ";" ] + maxas_i + [sass_file, cubin_file]) File "/search/odin/luxingjing/project/openai-gemm/openai_gemm.py", line 622, in run_command raise RuntimeError("Error(%d):\n%s\n%s" % (proc.returncode, cmd, err)) RuntimeError: Error(2): ptxas -v -arch sm_61 -o /search/speech/luxingjing/.cache/openai-gemm/sm_61/cubin/sgemm_128x128x8_NN_vec.cubin /search/speech/luxingjing/.cache/openai-gemm/sm_61/ptx/sgemm_128x128x8_NN_vec.ptx ; PERL5LIB=/search/odin/luxingjing/project/openai-gemm/maxas /search/odin/luxingjing/project/openai-gemm/maxas/maxas.pl -i -w -k sgemm_128x128x8_NN_vec -Dtype s -DNN 1 -Dvec 1 /search/odin/luxingjing/project/openai-gemm/sass/xgemm_128x128x8.sass /search/speech/luxingjing/.cache/openai-gemm/sm_61/cubin/sgemm_128x128x8_NN_vec.cubin /bin/sh: line 1: 38826 Floating point exceptionptxas -v -arch sm_61 -o /search/speech/luxingjing/.cache/openai-gemm/sm_61/cubin/sgemm_128x128x8_NN_vec.cubin /search/speech/luxingjing/.cache/openai-gemm/sm_61/ptx/sgemm_128x128x8_NN_vec.ptx /search/speech/luxingjing/.cache/openai-gemm/sm_61/cubin/sgemm_128x128x8_NN_vec.cubin: No such file or directory at /search/odin/luxingjing/project/openai-gemm/maxas/MaxAs/Cubin.pm line 138.

xingjinglu commented 6 years ago

When I dig further, the reason is ptxas failed to compile .ptx files into .cubin files. The error happend when I try to run openai-gemm on Titanxp with CUDA9 with neon as the backend.

The exact command is: python test.py

More detailed command is: ptxas -v -arch sm_61 -o /search/speech/luxingjing/.cache/openai-gemm/sm_61/cubin/sgemm_128x128x8_NN.cubin /search/speech/luxingjing/.cache/openai-gemm/sm_61/ptx/sgemm_128x128x8_NN.ptx Floating point exception

The content of sgemm_128x128x8_NN.ptx is as follow

.version 5.0 .target sm_61 .address_size 64

// args: {'type': 's'}

.visible .entry sgemm_128x128x8_NN( .param .u64 param_C, .param .u64 param_A, .param .u64 param_B, .param .f32 param_alpha, .param .f32 param_beta, .param .u32 param_cda, .param .u32 param_cdb, .param .u32 param_cdc, .param .u32 param_m, .param .u32 param_n, .param .u32 param_k, .param .u32 param_blk_a, .param .u32 param_blk_b ) .reqntid 256 {

.shared .align 4 .b32 share[4228];

ret;

}

yzhwang commented 6 years ago

I'm trying to compare openai-gemm with cublas v9 and I get the same FPE error when using the following command line to build: make lib/c_interface.o Is there any plan to make openai-gemm work with CUDA 9?

xingjinglu commented 6 years ago

https://github.com/xingjinglu/PerfAILibs/blob/master/README.md

On Tue, Nov 14, 2017 at 5:20 AM, Yangzihao Wang notifications@github.com wrote:

I'm trying to compare openai-gemm with cublas v9 and I get the same FPE error when using the following command line to build: 'make lib/c_interface.o' Is there any plan to make openai-gemm work with CUDA 9?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NervanaSystems/neon/issues/413#issuecomment-344062087, or mute the thread https://github.com/notifications/unsubscribe-auth/AHg108UkoGxD0BELjAzYVAWAU8xF7WI_ks5s2LK7gaJpZM4QS2An .