-
During the training of 3.6B and 7B with FSDP we experienced a loss spike after the loss as the model was moving towards convergence.
Things that we should check in our implementation:
- [x] Co…
-
I am trying to use the new FSDP feature in torch 2.1 where the require_grad does not need to be uniform across a block.
``` python
model = AutoModelForCausalLM.from_pretrained(
'some_lla…
-
## ❓ Questions and Help
### Before asking:
1. search the issues.
2. search the docs.
#### What is your question?
I try to train a 100 layers encoder and 100 layers decoder transformer. Be…
-
## 🚀 Feature
[Documentation says](https://lightning.ai/docs/pytorch/latest/advanced/compile.html#limitations) that torch compile is not supported over distributed training right now. Since torch co…
-
### 🚀 The feature, motivation and pitch
Currently FSDP is rejecting tensor parameters with dtype unit8. is_floating_point() only allows one of the (torch.float64, torch.float32, torch.float16, and …
-
### Reminder
- [X] I have read the README and searched the existing issues.
### System Info
Have installed all the requirements for Qwen2-vl
### Reproduction
train_mm_proj_only:True
Hello, I wan…
-
It is mentioned on README that candle supports multi GPU inference, using NCCL under the hood. How can this be implemented ? I wonder if there is any available example to look at..
Also, I know PyT…
-
FSDP2 provides smaller memory footprint, compatibility with torch compile, and more flexibility due to per param sharding. Does huggingface have plan to support FSDP2?
https://github.com/pytorch/to…
-
I am currently using the FSDP (Fully Sharded Data Parallelism) approach with the Llama 2 70B model. The training process has begun, but I encounter an error when attempting to save the checkpoint at e…
ghost updated
11 months ago
-
(for later)