NVIDIA / trt-samples-for-hackathon-cn

Simple samples for TensorRT programming
Apache License 2.0
1.5k stars 338 forks source link

Why do we need to run inference once before capturing the CUDA Graph ? #76

Closed lix19937 closed 1 year ago

lix19937 commented 1 year ago

@wili-65535

Why do we need to run inference once before capturing the CUDA Graph ?
https://github.com/NVIDIA/trt-samples-for-hackathon-cn/blob/master/cookbook/09-Advance/CudaGraph/CudaGraph.cpp#L167

Is it just to verify whether the inference is working properly ? Thanks.

wili-65535 commented 1 year ago

Some status of TensorRT runtime needs to be initialized during the first enqueue, so we should do one more inference before the CUDA graph capture.

lix19937 commented 1 year ago

@wili-65535 Thanks, Can it be understood as: warm up and let TRT Context do some initialization ?

Another quertion:
BTW, I learn your cuda graph sample, I think graph records the program operation process flow and all in/out parameter's addresses. If I change any parameter's address, the graph will capture failed. Some code snippet as follow:

            const T* hA[]{
                          attention_weights->query_weight.kernel,   /// pointer 
                          attention_weights->key_weight.kernel,      /// pointer                         
                          attention_weights->value_weight.kernel,  /// pointer
                          nullptr,
                          from_tensor,   /// pointer, is defined in init phase
                          from_tensor,   /// pointer, is defined in init phase
                          from_tensor,   /// pointer, is defined in init phase
                          nullptr,
                          q_buf_,  /// pointer, is defined in init phase
                          k_buf_,  /// pointer, is defined in init phase
                          v_buf_,  /// pointer, is defined in init phase
                          nullptr};
            // Note: Here, we assume the weights of each time may be different.
            // If we can preprocess these weights before inference, we can reduce the overhead
            // caused by cudaMemcpyAsync
            cudaMemcpyAsync((void*)batch_qkv_kernel_ptr_, hA, sizeof(T*) * 12, cudaMemcpyHostToDevice, stream);
            cublas_wrapper_->batchedGemm(
                                        CUBLAS_OP_N,
                                         CUBLAS_OP_N,
                                         n,
                                         m,
                                         k,
                                         (const void* const*)batch_qkv_kernel_ptr_,
                                         n,
                                         (const void* const*)batch_qkv_input_ptr_,
                                         k,
                                         (void* const*)batch_qkv_buf_ptr_,
                                         n,
                                         3,
                                         stream);

follow kernels are changed each time, both itself and the content it point to.
attention_weights->query_weight.kernel, attention_weights->key_weight.kernel, attention_weights->value_weight.kernel,
It seems not support cudagraph ?

wili-65535 commented 1 year ago

Can it be understood as: warm up and let TRT Context do some initialization ? -- Yes. graph records the program operation process flow and all in/out parameter's addresses. If I change any parameter's address, the graph will capture failed. -- Yes, referring to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#individual-node-update, CUDA provides some APIs to update the graph when the parameters or address changes. So, once you change some of them, you need to re-capture or update carefully.

lix19937 commented 1 year ago

@wili-65535 Thanks for your reply.