baidu-research / DeepBench

Benchmarking Deep Learning operations on different hardware
Apache License 2.0
1.07k stars 239 forks source link

Problems with NVIDIA Benchmarks #98

Open yl-jiang opened 6 years ago

yl-jiang commented 6 years ago

Environment:

  1. GPU cards: Tesla K80
  2. CUDA:8.0
  3. cuDNN:5.1
  4. OpenMPI:1.10.2

Problems:

After make there are five files in .../nvidia/bin , they are:

conv_bench gemm_bench nccl_mpi_all_reduce nccl_single_all_reduce rnn_bench

And I can successfully run 'rnn_bench', 'nccl_single_all_reduce',

  1. But when I run 'gemm_bench' it give me the error of "terminate called after throwing an instance of 'std::runtime_error'";
  2. run 'conv_bench' it will be stop when procedure doing the 11th test,and the error is " terminate called after throwing an instance of 'std::runtime_error' what(): Illegal algorithm passed to get_fwd_algo_string. Algo: 7"
  3. run 'nccl_mpi_all_reduce' the error is "terminate called after throwing an instance of 'std::runtime_error'what(): NCCL failure: invalid device pointer in nccl_mpi_all_reduce.cu at line: 86 rank: 0"

How can I fix it?

sharannarang commented 6 years ago

I haven't really tested DeepBench kernels for K80. Are you sure you compiled with the correct SM version? Are the drivers updated to run with CUDA 8.0?

jfurtek commented 6 years ago

1.) As currently written, gemm_bench will fail for Kepler GPUs for CUDA 8 and later. cublasGemmEx() is only supported on GPUs with SM 5.0 or greater (i.e. Maxwell and newer). https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx

  1. Algo 7 is CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED, and DeepBench has a case statement for that in get_fwd_algo_string() when CUDNN_MAJOR >= 6. Maybe a pre-cuDNNv6 header file was in your include path?
yl-jiang commented 6 years ago

I have changed CUDA version to 7.5 , cuDNN version to 5.0, and now the deepbench can run most of the benchmarks but except the 'nccl_mpi_all_reduce'.