-
Thank you for publishing the paper. I hope to get your answers to the following questions.:
Normally, the training speed will decline as the number of GPUs increases. However, in the paper, with the …
-
Wonderful work!
May I know the compatibility with ZeRO mechanism? E.g., Torch redundancy optimizer, deepspeed zero-1 to zero-3, and fairscale FSDP. Becaused I noticed that QLoRA relies on particula…
-
### 🐛 Describe the bug
when trying to finetune flan-t5 large with the Seq2seqTrainer module, and also passing fsdp_transformer_layer_cls_to_wrap="T5Block" and fsdp="full_shard auto_wrap", I got at fi…
-
For deep learning, when the model is large, model creation and initialization on host device will require tremendous time and sometimes causes host OOM. The existing [torchdistx](https://github.com/py…
-
## 🐛 Bug
Got RuntimeError when training transformer from scratch under `translation_multi_simple_epoch` task with fully sharded data parallel (FSDP).
### To Reproduce
Steps to reproduce the b…
thpun updated
3 years ago
-
-
## 🐛 Bug
There seems to be a discrepancy (in addition to https://github.com/pytorch/xla/issues/3718) in how `torch.nn.Linear` (`torch.nn.functional.linear`) is implemented and dispatched between th…
-
Is it possible to finetune the 7B model using 8*3090?
I had set:
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
but still got OOM:
torch.cuda.OutOfMemoryError: C…
-
### 🐛 Describe the bug
I have a model which contains some params need to be ignored ( else flat_param will raise an error), the construction code is like:
```
not_trainable = []
…
-
Recently, I have experimented DPO training for Vietnamese. I start with a strong SFT model, which is [vinai/PhoGPT-4B-Chat](https://huggingface.co/vinai/PhoGPT-4B-Chat), and follow the method describe…