NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.79k stars 2.13k forks source link

flux-demo failure of TensorRT 10.5 when running a single L40 GPU, how to implement 2-GPUs with L40 #4205

Open algorithmconquer opened 3 weeks ago

algorithmconquer commented 3 weeks ago

Currently, when I run Flux on a device with a single L40 GPU, I encounter an OutOfMemory error. I found another device L40 with two GPUs. How can I implement multi-GPU usage to run flux?

lix19937 commented 3 weeks ago

You can split your model.

yuanyao-nv commented 3 weeks ago

also cc: @asfiyab-nvidia

algorithmconquer commented 3 weeks ago

@lix19937 How to split the model for this issue?Could you provide relevant codes and resources?

lix19937 commented 3 weeks ago

Like follow:

assume model = cnn_backbone + cnn_neck + transformer_with_cnn_head    
then you can export `cnn_backbone + cnn_neck`   as onnx_a,  
                    `transformer_with_cnn_head` as onnx_b, 
then use trtexec make `onnx_a -> plan_a`  
                      `onnx_b -> plan_b`   

plan_a run at decive 0, plan_b run at device 1.   

More deatiled:
Each ICudaEngine object is bound to a specific GPU when it is instantiated, either by the builder or on deserialization. To select the GPU, use cudaSetDevice() before calling the builder or deserializing the engine. Each IExecutionContext is bound to the same GPU as the engine from which it was created. When calling execute() or enqueue(), ensure that the thread is associated with the correct device by calling cudaSetDevice() if necessary.