-
### 🚀 The feature, motivation and pitch
When training FSDP with mixed precision and fp16, we need to be aware of the reduced range and avoid over/underflow. If we sum all gradients for a large world …
-
This [`default_auto_wrap_policy`](https://github.com/facebookresearch/fairscale/blob/main/fairscale/nn/wrap/auto_wrap.py#L61) function has parameter `exclude_wrap_modules ` for excluding module types …
-
# 🐛 Bug
Using xformers.memory_efficient_attention with FSDP and torch.compile fails when using bfloat16, but works when using float32. It's unclear to me if this is an xformers bug, an FSDP bug, or…
-
I replicated the experiments of pythia28 on hh (Anthropic/hh-rlhf) using the open-source code. Here are some of the experimental results:
**SFT1**:
~~~
python -u train.py exp_name=sft gradient_ac…
-
I am finetuning vicuna using 4 * A100-80G GPUs. I meet some problem after finish training,
```
{'loss': 1.3641, 'learning_rate': 4.815273327803183e-08, 'epoch': 0.97}
{'loss': 1.35, 'learning_ra…
-
Hi! I'm using two A100 GPUs, each with 40GB of memory. This is the GPU memory utilization for my training. I'm almost reaching over 90% memory utilization on both A100 GPUs.
![image](https://github.…
-
Hello, Ashwinee Panda
I was very impressed with your work and wanted to thank you for the excellent contribution. I am currently following the tutorial using the openbookqa task to finally experime…
-
### 🐛 Describe the bug
In HSDP, the ranks within a replication group are equivalent in terms of their model and optimizer shards. In other words, any rank in a replication group can be selected to …
-
Hello, I am trying to do distributed training using 2 separate machines. Can anyone please guide me towards any tutorial / demo on this? The configs created using accelerate are:
_Machine 1_:
c…
-
### 🐛 Describe the bug
we are using multi-nodes training with FSDP, and we got the following error during checkpointing through `torch/distributed/checkpoint/state_dict_saver.py`
```
File "/opt/m…