flux-demo failure of TensorRT 10.5 when running a single L40 GPU, how to implement 2-GPUs with L40

algorithmconquer commented 3 weeks ago

Currently, when I run Flux on a device with a single L40 GPU, I encounter an OutOfMemory error. I found another device L40 with two GPUs. How can I implement multi-GPU usage to run flux?

lix19937 commented 3 weeks ago

You can split your model.

yuanyao-nv commented 3 weeks ago

also cc: @asfiyab-nvidia

algorithmconquer commented 3 weeks ago

@lix19937 How to split the model for this issue?Could you provide relevant codes and resources?

lix19937 commented 3 weeks ago

Like follow：

assume model = cnn_backbone + cnn_neck + transformer_with_cnn_head    
then you can export `cnn_backbone + cnn_neck`   as onnx_a,  
                    `transformer_with_cnn_head` as onnx_b, 
then use trtexec make `onnx_a -> plan_a`  
                      `onnx_b -> plan_b`   

plan_a run at decive 0, plan_b run at device 1.

More deatiled:
Each ICudaEngine object is bound to a specific GPU when it is instantiated, either by the builder or on deserialization. To select the GPU, use cudaSetDevice() before calling the builder or deserializing the engine. Each IExecutionContext is bound to the same GPU as the engine from which it was created. When calling execute() or enqueue(), ensure that the thread is associated with the correct device by calling cudaSetDevice() if necessary.

NVIDIA / TensorRT

flux-demo failure of TensorRT 10.5 when running a single L40 GPU, how to implement 2-GPUs with L40 #4205