Open YouSenRong opened 1 year ago
After I skip the memcpy step on host, there is not gap. However, it seems that there is still synchronization between streams when I use Gather OP, which triggers the myelin compile.
When I replace the Gather op by Tile op, the profile shows as bellow. The synchronization between streams disappear.
@zhenhuaw-me ^ ^
@YouSenRong Thank you for reporting this issue!
It seems to me that you are trying to understand the internal behavior of TensorRT since you see performance issue when using Gather
replacing Tile
. Is that true? (assume yes) It's nature that the behavior is associated with the layers you used. You can go ahead with anyone you want.
Could you please share your reproducing steps thus we can analysis if we can improve for your case?
For the document and implementation you asked, unfortunately, we don't document any CUDA stream behavior except the enqueue API. We don't document the behavior such as the keyword myelin related you see in Nsight. They are internal implementation - depending on them may result in undefined behavior.
@zhenhuaw-me Thanks for your response.
First, ignore the gap as I found that it is the memcpy step on the host (from pageable memory to pinned memory).
Yes, I am using Gather op to replace Tile op, and using multi-streams to overlap the H2D, Comp, and D2H. However, when I use Gather op, it triggers the myelin compile , and the myelinGraphExec seems to synchronize the D2H step in the multi-streams. (PS: the myelin compile takes too long time to build the engine).
I am not trying to understand the internal behavior, instead, I am trying to avoid the synchronization behavior, and trying to report the potential bug of the myelin exec, i.e. the synchronization phenomenon.
Specifically, as shown in the figure, there are 4 myelinGraphExec as I use 4 stream separately, but the D2H (corresponding to myelinGetMemory) of four myelinGraphExec all stand after the first meylinGraphExec, which seems that the D2H step of the 4 streams are synchronized. (I don't know if my assumption is true, but the Nsight profile shows that.)
For the reproduction, I am sorry that I can't shared the onnx file, maybe you can use a mmoe model. The Tile op is used to broadcast some inputs, in other words, padding some input to batchsize (from [1, x] to [batch_size, x]). The Gather op also can implement this broadcast behavior, and I replace the Tile op with Gather op in the onnx graph. The both ops can give a correct result.
@YouSenRong Yes, that "myelin sync" you see in Nsight profile is syncing the exec - an internal behavior which is by design currently. And we are trying to improve the "long build time" issue you mentioned when "myelin" is involved.
there are 4 myelinGraphExec as I use 4 stream separately
That should be just accidently the same :)
@zhenhuaw-me
that "myelin sync" you see in Nsight profile is syncing the exec - an internal behavior which is by design currently
Does that means the myelin synchronize the exec (D2H) on deference streams?I wonder whether the myelin sync the exec (e.g. memcpy, compute) among different streams?Thanks!
Description
I try to utilize multi-streams (e.g 4 stream) to overlap the H2D, compute and D2H time. However, there are gaps between each multi-streams as shown in profile. It seems that the myelin graph synchronize the streams of the D2H stage or something else, and thus degrade the pipeline of the multi-streams. PS: We user Gather op in the graph, dose this op make the difference?
Specifically, there are 4 meylinGraphExecute in different streams, but it seems that the meylinGraphUnload are done in the same stream, which synchronize the 4 streams. As there are not document of myelin, could you please provide some information of the mechanism of the myelinGraphExecute in multi-stream, with Gathrer op. Thanks very much!!!
Environment
TensorRT Version: 8.4.3 NVIDIA GPU: Tesla T4 NVIDIA Driver Version: 460.73.01 CUDA Version: 11.6 CUDNN Version: Operating System: Ubuntu 20.04 ONNX opset: 13 Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):
Relevant Files
Steps To Reproduce