-
[rank1]: Traceback (most recent call last):
[rank1]: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 200, in
[rank1]: train()
[rank1]: File "/storage/g…
-
## 🐛 Bug
Got errors when loading mBART.cc25 pretrained model for fine-tuning on `translation_multi_simple_epoch` in FSDP.
### To Reproduce
Steps to reproduce the behavior (**always include th…
thpun updated
2 years ago
-
### 🐛 Describe the bug
I used Hugging face training code.
I found during backward of training by using FSDP, the AllGather kernel doesn't overlap CatArrayBatchedCopy kernel. I don't know why.
s…
-
### 🐛 Describe the bug
Torch does not allow 2D FSDP + TP to get FULL_STATE_DICT. However, if I remove checks here:
https://github.com/pytorch/pytorch/blob/3f62b05d31d4b29d60874b05adc0e5aedbad3722/to…
-
the default model variant is "7b":
https://github.com/foundation-model-stack/fms-fsdp/blob/65b0ea670fa375bb0f7f6a285e7229bb96ebdd0f/fms_fsdp/config/training.py#L8
but it is not in the supported wh…
-
As in the title.. I spent a bit of time debugging it but haven't figured out the cause yet. E.g. running
```
tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full fsdp_cpu_…
-
This commit: https://github.com/pytorch/pytorch/commit/a8329676273ac12f1fadfbcdd19c500d84998345
Released in torch 2.1.0 breaks this https://github.com/facebookresearch/audiocraft/blob/main/audiocraft…
-
Hello, I found a strange loss during training as follow.
![image](https://github.com/user-attachments/assets/3732521d-d4c1-4378-9d7c-247254c068d1)
The loss in the first step is normal, but the los…
-
![image](https://github.com/AnswerDotAI/fsdp_qlora/assets/77484083/03335c76-e593-4534-9afa-84f16ff05007)
how to fix that
-
### 🚀 The feature, motivation and pitch
Hi Pytorch maintainers,
I am currently engaged in training multiple large language models (LLMs) sequentially on a single GPU machine, utilizing FullShard…