-
### Description
hello everyone,
I'm a newbie with t2t and tensorflow. I tried to use t2t to run transformer_moe model in 2 machines ,but failed. Each one has only one gpu. Hope you guys could help…
-
Exciting work. 👍I'm trying to run the dist_train.sh script you gave. And get the error below. I only have one GPU. Is this the error caused by this? Or is this only for distributed training? Is there …
-
Hi, size. I‘m using my own dataset to recurrent your work. I noticed you used slurm for training. But for me, I only can use distributed training with dist_train.sh to train my own project. But there …
-
## 🐛 Bug
There is an error when training falcon-7b model with thunder_cudnn.
### To Reproduce
Start a docker container:
```
mkdir -p output
docker run --pull=always --gpus all --ipc=host -…
-
I have a problem training the model with my own dataset when using Distributed Mode. I wish to train the model on 2 GPUs and the message I get is:
RuntimeError: Expected to have finished reduction …
-
## Description
As described in [PyTorch Lightning documentation](https://pytorch-lightning.readthedocs.io/en/1.4.9/advanced/multi_gpu.html), the logs need to be synchronised using `sync_dist=True`.
…
-
## Description
The CUJ looks like:
```
envd run --image xx --replicas 20
```
Then there will be one interactive shell, and users can type a command, which will run in all replicas.
Then …
-
hcho3 updated
4 years ago
-
I want to train VGG16_ImageNet_Distributed.py at multiple node using mpiexec (two gpu on one node)
so, I followed instructions in https://docs.microsoft.com/en-us/cognitive-toolkit/Multiple-GPUs-and-…
-
The script from `nbs/examples/distrib.py` is not working.
First, the mixed precision cb is now just `fp_16` as native has been integrated to fastai.
Second, I am getting a `RuntimeError: No grad acc…