NVlabs / NVBit

198 stars 18 forks source link

RNNT error, operation not permitted when stream is capturing #104

Closed mahmoodn closed 1 year ago

mahmoodn commented 1 year ago

I am not able to use NVbit with the RNNT from MLPerf 2.0. Please see the output below:

(mlperf) mahmood@mlperf-inference-mahmood-x86_64:/work$ LD_PRELOAD=/work/util/tracer_nvbit/nvbit_release/tools/opcode_hist/opcode_hist.so ./build/bin/harness_rnnt \
--plugins="/work/build/plugins/RNNTOptPlugin/librnntoptplugin.so" \
--logfile_outdir="/work/build/logs/2022.11.02-17.19.24/mahmood2022_TRT/rnnt/Offline" \
--logfile_prefix="mlperf_log_" \
--performance_sample_count=2513 --audio_batch_size=256 --audio_buffer_num_lines=4096 \
--dali_batches_issue_ahead=4 --dali_pipeline_depth=4 --num_warmups=512 \
--raw_data_dir="/work/build/preprocessed_data/rnnt_dev_clean_500_raw" \
--raw_length_dir="/work/build/preprocessed_data/rnnt_dev_clean_500_raw/int32" \
--preprocessed_data_dir="/work/build/preprocessed_data/rnnt_dev_clean_512/fp16" \
--preprocessed_length_dir="/work/build/preprocessed_data/rnnt_dev_clean_512/int32" \
--val_map="/work/data_maps/rnnt_dev_clean_512/val_map.txt" \
--mlperf_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/mlperf.conf" \
--user_conf_path="/work/measurements/mahmood2022_TRT/rnnt/Offline/user.conf" \
--batch_size=16 --cuda_graph=true --pipelined_execution=true --batch_sorting=true \
--enable_audio_processing=true --use_copy_kernel=true --streams_per_gpu=1 --audio_fp16_input=true \
--start_from_device=false --audio_serialized_pipeline_file="/work/build/bin/dali/dali_pipeline_gpu_fp16.pth" \
--scenario Offline --model rnnt --engine_dir="/work/build/engines/mahmood2022/rnnt/Offline"
------------- NVBit (NVidia Binary Instrumentation Tool v1.5.5) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
        KERNEL_BEGIN = 0 - Beginning of the kernel launch interval where to apply instrumentation
          KERNEL_END = 4294967295 - End of the kernel launch interval where to apply instrumentation
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
    COUNT_WARP_LEVEL = 1 - Count warp level or thread level instructions
    EXCLUDE_PRED_OFF = 0 - Exclude predicated off instruction from count
----------------------------------------------------------------------------------------------------
&&&& RUNNING RNN-T_Harness # ./build/bin/harness_rnnt
I1108 12:19:52.134712 34941 main_rnnt.cc:2903] Found 1 GPUs
[I] Starting creating QSL.
[I] Finished creating QSL.
[I] Starting creating SUT.
[I] Set to device 0
Dali pipeline creating..
Dali pipeline created
[I] Creating stream 0/1
[I] [TRT] [MemUsageChange] Init CUDA: CPU +534, GPU +0, now: CPU 962, GPU 2686 (MiB)
[I] [TRT] Loaded engine size: 81 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1244, GPU +348, now: CPU 2390, GPU 3036 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +180, GPU +60, now: CPU 2570, GPU 3096 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2596, GPU 3156 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2596, GPU 3164 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +232, now: CPU 0, GPU 232 (MiB)
[I] Created RnntEncoder runner: encoder
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2596, GPU 3398 (MiB)
[I] [TRT] Loaded engine size: 3 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2603, GPU 3406 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2603, GPU 3416 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 232 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2603, GPU 3420 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2603, GPU 3428 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1, now: CPU 0, GPU 233 (MiB)
[I] Created RnntDecoder runner: decoder
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2604, GPU 3428 (MiB)
[I] [TRT] Loaded engine size: 1 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointFc1 runner: fc1_a
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2604, GPU 3428 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointFc1 runner: fc1_b
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2604, GPU 3428 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2604, GPU 3436 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2604, GPU 3436 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointBackend runner: joint_backend
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2604, GPU 3436 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2604, GPU 3444 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2604, GPU 3454 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2604, GPU 3446 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2604, GPU 3454 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntIsel runner: isel
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2604, GPU 3454 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 0, GPU 236 (MiB)
[I] Created RnntIgather runner: igather
[I] Instantiated RnntEngineContainer runner
cudaMemcpy blocking 
cudaMemcpy blocking 
[I] Instantiated RnntTensorContainer host memory
Stream::Stream sampleSize: 61440
Stream::Stream singleSampleSize: 480
Stream::Stream fullseqSampleSize: 61440
Stream::Stream mBatchSize: 16
kernel 0 - (anonymous namespace)::kernelShapeCopyH2D(int*, std::array<int, 256ul>, unsigned int) - #thread-blocks 1,  kernel instructions 31, total instructions 31
  EXIT = 5
  IMAD = 4
  IMAD.MOV.U32 = 4
  IMAD.WIDE.U32 = 1
  ISETP.GE.U32.AND = 4
  LDC = 1
  MOV = 1
  S2R = 8
  SHF.L.U32 = 1
  STG.E = 1
  ULDC.64 = 1
kernel 1 - __myl_bb0_1_Gat - #thread-blocks 2,  kernel instructions 0, total instructions 31
kernel 2 - void cutlass::Kernel<cutlass_80_wmma_tensorop_h161616gemm_32x32_128x2_tn_align8>(cutlass_80_wmma_tensorop_h161616gemm_32x32_128x2_tn_align8::Params) - #thread-blocks 40,  kernel instructions 123520, total instructions 123551
  BAR.SYNC.DEFER_BLOCKING = 1120
  BRA = 1440
  BSSY = 160
  BSYNC = 160
  CS2R = 2560
  EXIT = 320
  HMMA.16816.F16 = 7680
  HMUL2 = 640
  HSETP2.GEU.AND = 1280
  HSETP2.NE.AND = 160
  IADD3 = 11840
  IADD3.X = 5920
  IMAD = 4160
  IMAD.IADD = 2880
  IMAD.MOV.U32 = 4960
  IMAD.SHL.U32 = 2080
  IMAD.U32 = 2240
  IMAD.WIDE = 480
  IMAD.WIDE.U32 = 1600
  IMAD.X = 320
  IMNMX = 160
  ISETP.EQ.AND = 160
  ISETP.GE.AND = 1440
  ISETP.GE.U32.AND = 160
  ISETP.GT.AND = 640
  ISETP.LT.AND = 2880
  ISETP.NE.AND = 320
  ISETP.NE.AND.EX = 640
  ISETP.NE.U32.AND = 640
  LDG.E.LTC128B.128 = 5120
  LDG.E.U16 = 320
  LDS.64 = 320
  LDSM.16.M88.4 = 8000
  LEA = 6720
  LEA.HI = 1280
  LEA.HI.SX32 = 160
  LEA.HI.X = 640
  LOP3.LUT = 3840
  MOV = 1280
  P2R = 640
  PLOP3.LUT = 960
  PRMT = 2400
  R2P = 1280
  R2UR = 160
  S2R = 960
  S2UR = 480
  SEL = 3040
  SHF.R.S32.HI = 2240
  SHF.R.U32.HI = 1600
  SHFL.IDX = 160
  STG.E.64 = 320
  STS = 640
  STS.128 = 5120
  UIADD3 = 8320
  UIMAD = 640
  UIMAD.WIDE = 160
  UISETP.GE.AND = 160
  UISETP.GE.OR = 160
  UISETP.NE.AND = 480
  ULDC = 800
  ULDC.64 = 640
  ULDC.U16 = 480
  ULEA = 160
  ULEA.HI = 320
  ULOP3.LUT = 960
  UMOV = 1440
  USHF.L.U32 = 1120
  USHF.R.S32.HI = 960
...

kernel 27 - select3(int, int, int, int, bool*, __half*, __half*, __half*, __half*, int*, int*, __half*, __half*, int*) - #thread-blocks 20,  kernel instructions 1280, total instructions 1406510
  EXIT = 180
  IADD3 = 20
  IMAD = 180
  ISETP.GE.AND = 160
  ISETP.NE.AND = 20
  LDG.E.U8 = 20
  LEA.HI.X.SX32 = 20
  MOV = 160
  S2R = 500
  ULDC.64 = 20
Cuda error in function 'cudaDeviceSynchronize()' file 'opcode_hist.cu' in line 221 : operation not permitted when stream is capturing.

Any idea about what to do? I have followed the guide from here to build the benchmarks and used the docker version. The device is RTX 3080 and according to nvidia-smi, the driver version on the host (outside of docker) is NVIDIA-SMI 510.39.01 Driver Version: 510.39.01 CUDA Version: 11.6.

ovilla commented 1 year ago

The opcode_hist tool is injecting cudaDeviceSynchronize() at each kernel invocation, so that you can see the "histograms prints" at each kernel execution.

The application you are using, performs stream capture https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#creating-a-graph-using-stream-capture which forbids cudaDeviceSynchronize() within regions of capture and thus the error.

One solution could be to modify the opcode_hist to avoid cudaDeviceSynchronize() in those regions.

mahmoodn commented 1 year ago

The code says:

            /* if we are exiting a kernel launch:
             * 1. Wait until the kernel is completed using
             * cudaDeviceSynchronize()
             * 2. Get number of thread blocks in the kernel
             * 3. Print the thread instruction counters
             * 4. Release the lock*/
            CUDA_SAFECALL(cudaDeviceSynchronize());

If I remove cudaDeviceSynchronize(), then I think the stats become messy because at some point, it doesn't wait for the kernel to finish and still count the number of blocks or other things. Is that right?