-
### 🐛 Describe the bug
Create simple distributed model
Wrapper model with FSDP.
Using stateful optimizer ala Adam(W) run without CPUoffload and profile/time.
Then run with CPUOffload and see th…
-
### 🐛 Describe the bug
I was tried to use offload fsdp to run the vicuna training,
however, after I run this command. the code looks like keeping still for 3-4hours and prints nothing like and th…
-
### 🚀 The feature, motivation and pitch
The following are features that should be checked / hardened in order to roll out fully_shard as an alternative to class-based FSDP:
[ ] Test with ShardedGr…
-
-
### System Info
transformers: '4.45.1'
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### 🐛 Describe the bug
I have fine-tuned `Llama-3.2-11B-Vision-Instruct` fo…
-
Hi and thanks for the great resources.
I used "train-deploy-llama3.ipynb" and trained a similar Llama3 model as shown in the notebook.
I pushed my model on hugging face and now I want to use that …
-
Unable to run torchrun --nnodes 1 --nproc_per_node 4 llama_finetuning.py --enable_fsdp --use_peft --peft_method lora --model_name /patht_of_model_folder/7B --pure_bf16 --output_dir Path/to/save/PEFT/…
-
## 🐛 Bug
When input sequences get longer, Thunder seems to tend to use more memory than eager and torch.compile.
Let's take litgpt's `stablecode-completion-alpha-3b` as an example whose sequen…
-
I follow the step in README, but I get the empty state dict. Here is the code and the output:
code:
```python
trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
…
-
### 🚀 The feature, motivation and pitch
**Background**
DistributedDataParallel (DDP) uses `Reducer` to bucket and issue `allreduce` calls. The main entry point of `Reducer` is through the gradient …
fegin updated
3 months ago