-
I am currently trying to retrain the BLIP2 architecture on a multi gpu setup using the default torch DDP implementation of the Lavis library.
My training proceeds fine until some steps with consol…
-
### Prerequisites
- [X] I have read the [ServerlessLLM documentation](https://serverlessllm.github.io/).
- [X] I have searched the [Issue Tracker](https://github.com/ServerlessLLM/ServerlessLLM/issue…
-
I ran mem_spd_test.py and got the following error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
I did not make any changes except …
-
i wonder how to put it on a machine with multi-GPU to accelerate its training?
-
Hello,
I have an error when running the Flux example script with multiple GPUs (H100). I tested with 2 and 4. With only one GPU, no error, the generation is going well. I tested by varying `pipefu…
-
Currently have an LLM engine built on TensorRT-LLM. Trying to evaluate different setups and gains on types.
Was trying to deploy the llama model on a multi-gpu, whereby between the 4 GPUs, I would hav…
-
### Already reported ? *
- [X] I have searched the existing open and closed issues.
### Regression?
No
### System Info and Version
irrelevant - not a software bug - documentation bug
### Descrip…
-
🚀 The feature, motivation and pitch
# RFC: Multi-Gpu Python Frontend API
This RFC compares and contrasts some ideas for exposing multi-gpu support in the python frontend.
1. The current `multigpu_sc…
-
I have encountered a bug that invalidates multi-GPU training. Each model stored per GPU then diverges from the others given that the initialization of the model is non-deterministic.
This happens f…
-
hello, I encountered some problems while using this code for multi-gpu training.
first I tried to run it with
"python3 train_dafnet.py --model_name "llama-2-7b" --device 0 --extra_device 1 2 3"
an…