-
When I use four GPU to train the model, I meet this error, can anybody help me slove this error? Thank you very much.
```
WARNING:__main__:
*****************************************
Setting OMP_…
-
### 🐛 Describe the bug
When running training with `fsdp` strategy in lightning on hundreds of GPUs, the first iteration takes extremely long (minutes...)
The culprit are these two N^2 checks
http…
-
Example:
```
import torch
def f(x):
buf = torch.zeros(2)
torch.ops.fsdp.set_(x, buf)
return x * x
x = torch.zeros(2, requires_grad=True)
out = torch.compile(f, backend="aot_eager…
-
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
llava/train/train_mem.py \
--model_name_or_path /path/to/checkpoint_llava_med \
--data_path /path/to/your_dental_dataset.jso…
-
## 🐛 Bug
This might be related to [old OOM issue](https://github.com/Lightning-AI/lightning-thunder/issues/474), but the models and # nodes is different, so I decided to create another one.
We …
-
### 🐛 Describe the bug
I was playing with local FSDP checkpointing for resuming interrupted training runs and I encountered an unexpected behavior.
In the code below, I do the following:
- Crea…
-
### 🐛 Describe the bug
Calling full_tensor() is giving incorrect tensors here.
I have created a minimal model to test checkpoint saving. I tried using DCP as well but that also gives incorrect ten…
-
Hi @Hprairie, I previously built mamba-2/hydra-based models, and I am now trying to replace the layers with your [bi-mamba2 module](https://github.com/Hprairie/Bi-Mamba2?tab=readme-ov-file#module-api)…
-
## 🐛 Bug
For a few models ( Platypus-30B with FSDP zero3, Gemma7b with DDP and vicuna-33b-v1.3 with FSDP zero3) we get segmentation fault error when trying to use fp8 with thunder_cudnn. When usi…
-
### 🚀 The feature, motivation and pitch
https://github.com/pytorch/pytorch/issues/75255 implemented the ability to ignore FSDP parameters at the module level, i.e. by passing in an `ignore_module` li…