Problems with NVIDIA Benchmarks

yl-jiang commented 6 years ago

Environment:

GPU cards: Tesla K80
CUDA:8.0
cuDNN:5.1
OpenMPI:1.10.2

Problems:

After make there are five files in .../nvidia/bin , they are:

conv_bench gemm_bench nccl_mpi_all_reduce nccl_single_all_reduce rnn_bench

And I can successfully run 'rnn_bench', 'nccl_single_all_reduce',

But when I run 'gemm_bench' it give me the error of "terminate called after throwing an instance of 'std::runtime_error'";
run 'conv_bench' it will be stop when procedure doing the 11th test,and the error is " terminate called after throwing an instance of 'std::runtime_error' what(): Illegal algorithm passed to get_fwd_algo_string. Algo: 7"
run 'nccl_mpi_all_reduce' the error is "terminate called after throwing an instance of 'std::runtime_error'what(): NCCL failure: invalid device pointer in nccl_mpi_all_reduce.cu at line: 86 rank: 0"

How can I fix it?

sharannarang commented 6 years ago

I haven't really tested DeepBench kernels for K80. Are you sure you compiled with the correct SM version? Are the drivers updated to run with CUDA 8.0?

jfurtek commented 6 years ago

1.) As currently written, gemm_bench will fail for Kepler GPUs for CUDA 8 and later. cublasGemmEx() is only supported on GPUs with SM 5.0 or greater (i.e. Maxwell and newer). https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx

Algo 7 is CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED, and DeepBench has a case statement for that in get_fwd_algo_string() when CUDNN_MAJOR >= 6. Maybe a pre-cuDNNv6 header file was in your include path?

yl-jiang commented 6 years ago

I have changed CUDA version to 7.5 , cuDNN version to 5.0, and now the deepbench can run most of the benchmarks but except the 'nccl_mpi_all_reduce'.

baidu-research / DeepBench

Problems with NVIDIA Benchmarks #98