-
**Describe the bug**
We were trying to train a moe (ds experts = 2, expert size 8b) model on 2 A100 (40G) nodes, zero stage 2,
- Training would fail if using rdma with model constructed by deepsp…
-
### 软件环境
```Markdown
- paddlepaddle:
- paddlepaddle-gpu: 2.6
- paddlenlp: 2.7.1.post0
```
### 重复问题
- [X] I have searched the existing issues
### 错误描述
```Markdown
正常情况下,开启--amp_m…
-
Is there a doc introducing the usage for distributed parameters like "num_shards, shard_id, run_id, distributed_transport and distributed_interfaces, etc"?
Seems there are even no terminology explana…
-
I have created new JSON file according to my requirement :
* `training.json`
* `test.json`
the model trains using `training.json` but gives error while calculating val_loss using `test.json`
I …
-
### System Info
- `transformers` version: 4.41.2
- Platform: Linux-5.15.0-1044-nvidia-x86_64-with-glibc2.35
- Python version: 3.10.0
- Huggingface_hub version: 0.23.0
- Safetensors version: 0.4.2…
-
Thank you for your excellent work! I am very interested in your work and am currently using multiple GPUs for distributed training. As a beginner, I would like to ask if it is normal for the number of…
-
I'm working on the C version of the code in preparation for (#40)
So llm.c with **no** code modifications I observe the following:
- `test_gpt2` works successfully and the loss matches
- `train_g…
-
I was wondering why in the finetune.py file you've set update_freq to be 24/NUM_GPU.
```
cmd.append("+optimization.update_freq='[" + str(int(24/NUM_GPU)) + "]'")
```
In the wav2vec Readme …
-
Hello,
When I start multi gpu training. I run the following command.
python -m torch.distributed.launch --nproc_per_node=2 train.py --split eigen_zhou --learning_rate 1e-4 --height 320 --width 1024 …
-
Hi, I have problem in fine-tunning sgpt-bloom-7b1-msmarco because of oom error, could you please share how you do contrasive fine-tuning on bloom-7b1? (I think distributed training is needed, but I fa…