NVlabs / NVBit

199 stars 18 forks source link

ASSERT FAIL: sass_lib.h:1064:void SassInstr::decode(): FAIL !(opcode_end != std::string::npos) #30

Closed cesar-avalos3 closed 3 years ago

cesar-avalos3 commented 3 years ago

Hello, I'm trying to run the DeepBench benchmarks with the opcode_hist tool included in NVBit, and I've encountered this assert error: ASSERT FAIL: sass_lib.h:1064:void SassInstr::decode(): FAIL !(opcode_end != std::string::npos) I'm using CUDA 11.0, the latest NVBit release (1.5), and I'm using a V100. These are the utilized application parameters: ./conv_bench train half 7 7 832 16 128 5 5 2 2 1 1 Nsight-compute reveals the offending kernel is called "volta_hcudnn_128x128_stridedB_splitK_small_nn_v1". The program has no issues finishing without NVBit, or, weirdly enough, when using the instr_count tool.

x-y-z commented 3 years ago

Do you mind uploading the binary? Which cudnn version are you using?

Thanks.

cesar-avalos3 commented 3 years ago

cuDNN 8, and the binary is attached conv_bench.zip

x-y-z commented 3 years ago

I couldn't reproduce the error locally with CUDA 11.0, cudnn 8.04, and driver r450.66. Are you using a different version of cudnn? ls -l /usr/local/cuda/lib64/libcudnn* should tell us the full version number of cudnn.

Can you try to use this instr_count tool? It should print out more information. [instr_count.so removed, since it does not trigger the bug.]

Thanks.

cesar-avalos3 commented 3 years ago

Looking inside lib64 I get: libcudnn.so.8, looking inside /include/cudnn_version.h I get:

define CUDNN_MAJOR 8

define CUDNN_MINOR 0

define CUDNN_PATCHLEVEL 0,

I am using a different driver though, 455.23.05.

x-y-z commented 3 years ago

I tried cudnn 8.0.0 and still cannot reproduce. Let me know when you get new error messages by running the instr_count.so I provided above. Thanks.

x-y-z commented 3 years ago

Sorry, I misread your message. I can reproduce using opcode_hist.so. Working on it now.

x-y-z commented 3 years ago

BTW, the latest cudnn 8.0.4.30 does not have the issue. You could consider update your cudnn first. Thanks.

cesar-avalos3 commented 3 years ago

I agree, that seems to be the more sensible option. I saw 8.0.0 is flagged as a preview release, might have some issues. Thanks.

ovilla commented 3 years ago

Regardless of the possible issue with cudnn, I think NVBit should fail more gracefully (or not fail at all since the app still runs even with possible bugs on cudnn). I am reopening the issue while we work on a solution also on the NVBit side. Thanks for reporting this.

ovilla commented 3 years ago

This should be solved in NVBit version 1.5.1, however we strongly recommend to use CUDA 11.1 (and not CUDA 11.0) when using cudnn >= 8.0.