-
has anyone successfully got this running?
No combination of accelerator settings or even changing the script to use gloo has let me successfully fully run the script.
I did get to the point of b…
-
**Describe the bug**
I'm using DeepSpeed MoE layer to build a multi-modal LLM, I'm using Phi-3 as the base model, and replaced the MLP layer with MoE layer in DeepSpeed. However, when I enabled exper…
-
Hallo,
I have been training model in distributed pytorch using hugging face trainer API. Now i have been training model on slrum multi node multi gpu and for every GPU, it logs in mlflow ui. Is th…
-
is it possible to use a split/multi-domain on a single GPU or is the domain-to-device relation a fixed one? while using multiple gpus works very well, and all the swapping in/out this would result in …
-
It's a great job!
But I noticed that your results were obtained on 4 NVIDIA A40 GPUs.
This is very counterintuitive, less mirrors of the training view will increase the burden of training.
Can i…
-
Hello,
We're trying to use PyTorch on our furniture dataset but keep encountering mode collapse. Have attached the 5th and produced images. I've read through other folks strategies with the same is…
-
```
def build_model(cfg):
"""
Builds the video model.
Args:
cfg (configs): configs that contains the hyper-parameters to build the
backbone. Details can be seen in sl…
-
I have an interesting observation. If I add a few Crop layers to the mobilenet model, it becomes quite slow - more than 3x slower. I have a multi GPU setup. Is this happening because Crop layers are n…
-
Hi, when I run multi-node training (multiple nodes, multiple GPUs per node, using PyTorch 2.0 and PyTorch lightning), the training hangs at some point with an error 12 message, see attachment.
I am…
-
```
This is an enhancement rather than an issue.
It would be nice to have several functions operate across multiple gpus
double gpuSumProd(vec, vec): this would compute the dot-product of two vecto…