NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.62k stars 2.11k forks source link

Does myelin of Tensorrt synchronize the multi-streams? #2733

Open YouSenRong opened 1 year ago

YouSenRong commented 1 year ago

Description

I try to utilize multi-streams (e.g 4 stream) to overlap the H2D, compute and D2H time. However, there are gaps between each multi-streams as shown in profile. It seems that the myelin graph synchronize the streams of the D2H stage or something else, and thus degrade the pipeline of the multi-streams. PS: We user Gather op in the graph, dose this op make the difference?

image

Specifically, there are 4 meylinGraphExecute in different streams, but it seems that the meylinGraphUnload are done in the same stream, which synchronize the 4 streams. As there are not document of myelin, could you please provide some information of the mechanism of the myelinGraphExecute in multi-stream, with Gathrer op. Thanks very much!!! 企业微信截图_64b4c94c-a4c0-4003-837d-0dca7a3bc56e

image

Environment

TensorRT Version: 8.4.3 NVIDIA GPU: Tesla T4 NVIDIA Driver Version: 460.73.01 CUDA Version: 11.6 CUDNN Version: Operating System: Ubuntu 20.04 ONNX opset: 13 Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):

Relevant Files

Steps To Reproduce

YouSenRong commented 1 year ago

After I skip the memcpy step on host, there is not gap. However, it seems that there is still synchronization between streams when I use Gather OP, which triggers the myelin compile. image

When I replace the Gather op by Tile op, the profile shows as bellow. The synchronization between streams disappear. image

zerollzeng commented 1 year ago

@zhenhuaw-me ^ ^

zhenhuaw-me commented 1 year ago

@YouSenRong Thank you for reporting this issue!

It seems to me that you are trying to understand the internal behavior of TensorRT since you see performance issue when using Gather replacing Tile. Is that true? (assume yes) It's nature that the behavior is associated with the layers you used. You can go ahead with anyone you want.

Could you please share your reproducing steps thus we can analysis if we can improve for your case?

For the document and implementation you asked, unfortunately, we don't document any CUDA stream behavior except the enqueue API. We don't document the behavior such as the keyword myelin related you see in Nsight. They are internal implementation - depending on them may result in undefined behavior.

YouSenRong commented 1 year ago

@zhenhuaw-me Thanks for your response.

First, ignore the gap as I found that it is the memcpy step on the host (from pageable memory to pinned memory).

Yes, I am using Gather op to replace Tile op, and using multi-streams to overlap the H2D, Comp, and D2H. However, when I use Gather op, it triggers the myelin compile , and the myelinGraphExec seems to synchronize the D2H step in the multi-streams. (PS: the myelin compile takes too long time to build the engine).

I am not trying to understand the internal behavior, instead, I am trying to avoid the synchronization behavior, and trying to report the potential bug of the myelin exec, i.e. the synchronization phenomenon.

Specifically, as shown in the figure, there are 4 myelinGraphExec as I use 4 stream separately, but the D2H (corresponding to myelinGetMemory) of four myelinGraphExec all stand after the first meylinGraphExec, which seems that the D2H step of the 4 streams are synchronized. (I don't know if my assumption is true, but the Nsight profile shows that.)

image

For the reproduction, I am sorry that I can't shared the onnx file, maybe you can use a mmoe model. The Tile op is used to broadcast some inputs, in other words, padding some input to batchsize (from [1, x] to [batch_size, x]). The Gather op also can implement this broadcast behavior, and I replace the Tile op with Gather op in the onnx graph. The both ops can give a correct result.

zhenhuaw-me commented 1 year ago

@YouSenRong Yes, that "myelin sync" you see in Nsight profile is syncing the exec - an internal behavior which is by design currently. And we are trying to improve the "long build time" issue you mentioned when "myelin" is involved.

there are 4 myelinGraphExec as I use 4 stream separately

That should be just accidently the same :)

YouSenRong commented 1 year ago

@zhenhuaw-me

that "myelin sync" you see in Nsight profile is syncing the exec - an internal behavior which is by design currently

Does that means the myelin synchronize the exec (D2H) on deference streams?I wonder whether the myelin sync the exec (e.g. memcpy, compute) among different streams?Thanks!