use multi-stream for TensorRT Engine Op

zhyncs commented 3 months ago

Is your feature request related to a problem? Please describe. Hi all

TLDR We want to use multi-stream for TensorRT Engine Op.

Currently, the TensorRT Engine Op uses TensorFlow's default CUDA Stream. https://github.com/alibaba/BladeDISC/blob/4d35390cc29cec95489c73a029e1ffc58b92ceae/tensorflow_blade/src/custom_ops/trt_engine_op/trt_engine_op.cc#L299-L311

During stability testing, it was discovered that the GPU driver would freeze almost 100% after running for a period of time. The problem was pinpointed using pstack and found to be related to the stream. The call stack appears as follows:

stream_executor::gpu::GpuDriver::SynchronizeContext(stream_executor::gpu::GpuContext*)()
...
tensorflow::GPUUtil::SyncAll(tensorflow::Device*)()

Describe the solution you'd like

Calling enqueueV2() in from the same IExecutionContext object with different CUDA streams concurrently results in undefined behavior. To perform inference concurrently in multiple streams, use one execution context per stream.

We want to use the multi-stream approach to avoid this issue, initializing instance_count IExecutionContext, each with a stream. When enqueuing in each context, synchronization can be used.

Do you have any suggestions? Thanks.

Describe alternatives you've considered N/A

Additional context N/A

zhyncs commented 3 months ago

Here's why a part of the code trt_engine_op, tf_blade in this project is used:

Currently, models for CTR scenarios such as search and recommendation are usually trained using TensorFlow or PyTorch, both domestically and internationally. We may want to use TensorRT to accelerate. If a CTR model trained with TensorFlow 2 can directly use TF2TRT in TensorFlow 2. If it's a PyTorch-trained CTR model, Torch TensorRT can be used directly. In China large companies, TensorFlow 1.15 is still widely used for CTR scenarios. Moreover, it typically involves a considerable amount of custom Ops. If we want to accelerate using TensorRT, we also need something similar to TF2TRT in TensorFlow 1.15. However, the implementation in TensorFlow 1.15 is too old and does not have very good support for various Ops.

We found that BladeDISC has relatively comprehensive support for this. We first use tf_blade to perform subgraph slicing, then use tensorflow-onnx and onnx-tensorrt for conversion, and finally use the runtime of trt_engine_op for online inference. We made some adaptations and changes based on this. Thanks all.

zhyncs commented 2 months ago

Currently, after making these modifications, the stability pressure test online did not reproduce the previous issue. I am synchronizing this information here. This issue has been closed for now. If there is a need for further discussion in the future, it can be reopened. Thanks to this outstanding project. cc @ispobock

alibaba / BladeDISC

use multi-stream for TensorRT Engine Op #1305