Open Valerianding opened 1 month ago
kind = cudart.cudaMemcpyKind.cudaMemcpyHostToDevice [cuda_call(cudart.cudaMemcpyAsync(inp.device[i], inp.host[i], inp.nbytes, kind, stream)) for inp in inputs] cudart.cudaEventRecord(eventsBefore[i],stream) cudart.cudaStreamWaitEvent(stream,eventsBefore[i],cudart.cudaEventWaitDefault) context.execute_async_v2(bindings=bindings[i], stream_handle=stream) cudart.cudaEventRecord(eventsAfter[i],stream) cudart.cudaStreamWaitEvent(stream,eventsAfter[i],cudart.cudaEventWaitDefault) kind = cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost [cuda_call(cudart.cudaMemcpyAsync(out.host[i], out.device[i], out.nbytes, kind, stream)) for out in outputs]
the code is not right, ops in the same stream are serial. I think cudaEventRecord cudaStreamWaitEvent
are not need. Usually each stream in one trt ctx in one thread.
kind = cudart.cudaMemcpyKind.cudaMemcpyHostToDevice [cuda_call(cudart.cudaMemcpyAsync(inp.device[i], inp.host[i], inp.nbytes, kind, stream)) for inp in inputs] cudart.cudaEventRecord(eventsBefore[i],stream) cudart.cudaStreamWaitEvent(stream,eventsBefore[i],cudart.cudaEventWaitDefault) context.execute_async_v2(bindings=bindings[i], stream_handle=stream) cudart.cudaEventRecord(eventsAfter[i],stream) cudart.cudaStreamWaitEvent(stream,eventsAfter[i],cudart.cudaEventWaitDefault) kind = cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost [cuda_call(cudart.cudaMemcpyAsync(out.host[i], out.device[i], out.nbytes, kind, stream)) for out in outputs]
the code is not right, ops in the same stream are serial. I think
cudaEventRecord cudaStreamWaitEvent
are not need. Usually each stream in one trt ctx in one thread.
Yeap I think so, but why this code goes wrong. I will try 1 trt ctx for 1 stream i think that might works. But i still want to know what goes wrong
Description
Hi, I'am using multi-stream to improve TensorRT inference Latency & Throughput. Here' the inference code I modified from TensorRT repo's example. common_runtime.py
the inputs are a List[List] , which inputs[i][j] is the i'th input tensor memory allocated for stream j, and the same for the outputs. And the idea of the code is that every time we execute we set the new bindings(tensor address).
But the inference results are wrong. Each inference has a different results. But after I inserted a cudart.cudaStreamSynchronize(stream) after [cuda_call(cudart.cudaMemcpyAsync(out.host[i], out.device[i], out.nbytes, kind, stream)) for out in outputs] the results seems to be ok. So I Checked the timeline using Nsight sys, I found there is memcpyh2d during the inference time
But using sync, there is no memcpyh2d during inference:
I want to know why. Since I have already inserted cuda event to prevent this. And Even if I don't insert the cuda event, just with the stream Synchronize, the code seems to be ok(the second photo). I searched the internet, it seems we need to use multi context and mps? Plz help me if you know the best way to use multi-stream.
Environment
TensorRT Version: 8.6.2 GPU Type: A10 Nvidia Driver Version: 11.4 CUDA Version: 11.4 Operating System + Version: centos x86_64 Python Version (if applicable): 3.9