-
## ❓ Questions and Help
I tried to follow the tutorial to change my codes to use FSDP; however, I do not know how to resume the training properly.
Every time I resume, it seems to restart from scra…
-
Hello,
The `.fit()` method of the Trainer is [missing the `optimizers` parameters](https://github.com/mosaicml/composer/blob/4c5ba954e3007ce2af6eb3003efa9d76de38c959/composer/trainer/trainer.py#L1611…
-
Hi! I'm using two A100 GPUs, each with 40GB of memory. This is the GPU memory utilization for my training. I'm almost reaching over 90% memory utilization on both A100 GPUs.
![image](https://github.…
-
### 🐛 Describe the bug
Keep getting this error.
```
Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cpu! (when checking argument for argument state_steps…
-
Hi, I recently upgraded to PyTorch 1.12 and have had issues with loading a saved optimizer state using FSDP here and the issue seems something that is addressed in comments here -
https://github.com/…
-
script used to finetune `lmsys/vicuna-7b-v1.5`
```
CUDA_VISIBLE_DEVICES="7,6,5,4,3,2" torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_mem.py \
--model_name_or_path lmsys/…
-
### 🐛 Describe the bug
A runtime error occurs when attempting to load the state dict of an fsdp model under `torch.inference_mode()`:
```py
import os
import torch.cuda
import torch.nn as nn…
-
There's two use-cases for [`torch_xla`](https://github.com/pytorch/xla) for the pytorch backend in Keras, namely:
1. Implement the [distribution API](https://github.com/keras-team/keras/blob/048416…
-
as titled, we should get rid of the with_comms decorator https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/distributed/_tensor/common_dtensor.py#L355
Instead, init and destroy th…
-
Similar to DDP and ZeroRedundancyOptimizer, FSDP can support optimizer overlap with backward pass by calling functional optimizers
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-…