fsdp Search Results - Githubissues

1000+ results
for fsdp

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

tatsu-lab/stanford_alpaca #162

error of multi-GPU: torch.distributed.elastic.multiprocessi…

When I use four GPU to train the model, I meet this error, can anybody help me slove this error? Thank you very much. ``` WARNING:__main__: ***************************************** Setting OMP_…

xiaoweiweixiao updated 7 months ago
32
pytorch/pytorch #120704

FSDP First iteration takes very long

### 🐛 Describe the bug When running training with `fsdp` strategy in lightning on hundreds of GPUs, the first iteration takes extremely long (minutes...) The culprit are these two N^2 checks http…

jglaser updated 7 months ago
1
pytorch/pytorch #129892

proxy tracing overwrites proxies when tracing mutations

Example: ``` import torch def f(x): buf = torch.zeros(2) torch.ops.fsdp.set_(x, buf) return x * x x = torch.zeros(2, requires_grad=True) out = torch.compile(f, backend="aot_eager…

bdhirsh updated 3 months ago
4
Lareina2441/LLaVA-Med #2

作者又在自言自语

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \ llava/train/train_mem.py \ --model_name_or_path /path/to/checkpoint_llava_med \ --data_path /path/to/your_dental_dataset.jso…

Lareina2441 updated 1 week ago
3
Lightning-AI/lightning-thunder #1233

OOM for training on 4 nodes for falcon-40b and vicuna-33b-v…

## 🐛 Bug This might be related to [old OOM issue](https://github.com/Lightning-AI/lightning-thunder/issues/474), but the models and # nodes is different, so I decided to create another one. We …

mpatel31415 updated 1 week ago
1
pytorch/pytorch #117421

FSDP silently ignores a state_dict loaded between forward an…

### 🐛 Describe the bug I was playing with local FSDP checkpointing for resuming interrupted training runs and I encountered an unexpected behavior. In the code below, I do the following: - Crea…

baldassarreFe updated 8 months ago
2
pytorch/pytorch #129206

Bug in calling full_tensor() when model is sharded with tens…

### 🐛 Describe the bug Calling full_tensor() is giving incorrect tensors here. I have created a minimal model to test checkpoint saving. I tried using DCP as well but that also gives incorrect ten…

mayank31398 updated 2 months ago
6
Hprairie/Bi-Mamba2 #3

Invalid gradient and Dx?

Hi @Hprairie, I previously built mamba-2/hydra-based models, and I am now trying to replace the layers with your [bi-mamba2 module](https://github.com/Hprairie/Bi-Mamba2?tab=readme-ov-file#module-api)…

YicongHong updated 1 month ago
8
Lightning-AI/lightning-thunder #756

Segmentation fault for fp8 and thunder_cudnn

## 🐛 Bug For a few models ( Platypus-30B with FSDP zero3, Gemma7b with DDP and vicuna-33b-v1.3 with FSDP zero3) we get segmentation fault error when trying to use fp8 with thunder_cudnn. When usi…

mpatel31415 updated 3 weeks ago
4
pytorch/pytorch #77155

FSDP: ability to ignore parameters

### 🚀 The feature, motivation and pitch https://github.com/pytorch/pytorch/issues/75255 implemented the ability to ignore FSDP parameters at the module level, i.e. by passing in an `ignore_module` li…

rohan-varma updated 1 year ago
5

上一页 1...28 29 30 31 32 33 34...100 下一页

1000+ results for fsdp

1000+ results
for fsdp