Closed lix19937 closed 1 year ago
Some status of TensorRT runtime needs to be initialized during the first enqueue, so we should do one more inference before the CUDA graph capture.
@wili-65535 Thanks, Can it be understood as: warm up and let TRT Context do some initialization ?
Another quertion:
BTW, I learn your cuda graph sample, I think graph records the program operation process flow and all in/out parameter's addresses. If I change any parameter's address, the graph will capture failed
. Some code snippet as follow:
const T* hA[]{
attention_weights->query_weight.kernel, /// pointer
attention_weights->key_weight.kernel, /// pointer
attention_weights->value_weight.kernel, /// pointer
nullptr,
from_tensor, /// pointer, is defined in init phase
from_tensor, /// pointer, is defined in init phase
from_tensor, /// pointer, is defined in init phase
nullptr,
q_buf_, /// pointer, is defined in init phase
k_buf_, /// pointer, is defined in init phase
v_buf_, /// pointer, is defined in init phase
nullptr};
// Note: Here, we assume the weights of each time may be different.
// If we can preprocess these weights before inference, we can reduce the overhead
// caused by cudaMemcpyAsync
cudaMemcpyAsync((void*)batch_qkv_kernel_ptr_, hA, sizeof(T*) * 12, cudaMemcpyHostToDevice, stream);
cublas_wrapper_->batchedGemm(
CUBLAS_OP_N,
CUBLAS_OP_N,
n,
m,
k,
(const void* const*)batch_qkv_kernel_ptr_,
n,
(const void* const*)batch_qkv_input_ptr_,
k,
(void* const*)batch_qkv_buf_ptr_,
n,
3,
stream);
follow kernels are changed each time, both itself and the content it point to.
attention_weights->query_weight.kernel, attention_weights->key_weight.kernel, attention_weights->value_weight.kernel,
It seems not support cudagraph ?
Can it be understood as: warm up and let TRT Context do some initialization ? -- Yes. graph records the program operation process flow and all in/out parameter's addresses. If I change any parameter's address, the graph will capture failed. -- Yes, referring to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#individual-node-update, CUDA provides some APIs to update the graph when the parameters or address changes. So, once you change some of them, you need to re-capture or update carefully.
@wili-65535 Thanks for your reply.
@wili-65535
Why do we need to run inference once before capturing the CUDA Graph ?
https://github.com/NVIDIA/trt-samples-for-hackathon-cn/blob/master/cookbook/09-Advance/CudaGraph/CudaGraph.cpp#L167
Is it just to verify whether the inference is working properly ? Thanks.