Open algorithmconquer opened 3 weeks ago
You can split your model.
also cc: @asfiyab-nvidia
@lix19937 How to split the model for this issue?Could you provide relevant codes and resources?
Like follow:
assume model = cnn_backbone + cnn_neck + transformer_with_cnn_head
then you can export `cnn_backbone + cnn_neck` as onnx_a,
`transformer_with_cnn_head` as onnx_b,
then use trtexec make `onnx_a -> plan_a`
`onnx_b -> plan_b`
plan_a run at decive 0, plan_b run at device 1.
More deatiled:
Each ICudaEngine
object is bound to a specific GPU when it is instantiated, either by the builder or on deserialization. To select the GPU, use cudaSetDevice() before calling the builder or deserializing the engine. Each IExecutionContext
is bound to the same GPU as the engine from which it was created. When calling execute() or enqueue(), ensure that the thread is associated with the correct device by calling cudaSetDevice() if necessary.
Currently, when I run Flux on a device with a single L40 GPU, I encounter an OutOfMemory error. I found another device L40 with two GPUs. How can I implement multi-GPU usage to run flux?