fsdp Search Results - Githubissues

1000+ results
for fsdp

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

databricks/megablocks #154

Memory cost increase gradually until oom

I use the dmoe in deepspeed or fsdp. i find in the begining, the memory cost is about 33g. As the number of training increases, the occupied video memory increases a little bit and finally exceeds 80g…

maobenz updated 4 days ago
2
hiyouga/LLaMA-Factory #5077

Very high memory usage for a small model

### Reminder - [X] I have read the README and searched the existing issues. ### System Info Platform: Kaggle 2xT4 - `llamafactory` version: 0.8.4.dev0 - OS: Linux-5.15.154+-x86_64-with-glib…

avcode-exe updated 2 months ago
2
facebookresearch/fairseq #3875

layerdrop would cause model training to hang when used with …

## 🐛 Bug If you try train a model with fully_sharded backend and use layer drop, the model training will hang. Each individual layer was also wrapped with fsdp in my particular case. It will be gre…

arbabu123 updated 3 years ago
2
AnswerDotAI/fsdp_qlora #24

ProcessExitedException: process 0 (2x 4090)

I'm trying what looks like the "Hello World" of this repo: Running the basic training on a Runpod community cloud `2 x RTX 4090, (128 vCPU 125 GB RAM)` configuration. Normally I'd play around with thi…

Pugio updated 5 months ago
39
foundation-model-stack/fms-fsdp #66

[speculator training] Support for loading different HF check…

For currently training a speculator using the specu-train branch, getting OOM error when trying to load a checkpoint in HuggingFace format. The model_type is "gpt_megatron". The script works fine for…

pavi2707 updated 6 months ago
1
eric-mitchell/direct-preference-optimization #5

Is there a plan to support multi-node traning?

I haven't found a good multi-node best practice for FSDP, have you tried it? Thank you in advance. :)

huybery updated 10 months ago
6
Bryan-Roe/semantic-kernel #380

Bumps [torch](https://github.com/pytorch/pytorch) from 2.2.0…

Bumps [torch](https://github.com/pytorch/pytorch) from 2.2.0 to 2.2.1. Release notes Sourced from torch's releases. PyTorch 2.2.1 Release, bug fix release This release is meant to fix the following …

Bryan-Roe updated 2 weeks ago
1
Lightning-AI/lit-llama #361

FSDPStrategy fails with multiple GPUs

I have been trying to finetune LLAMA, that on 7B size and 8 v100 GPUs takes longer than a day on the original `lora.py` script, this seemed wrong as training time of few hours often is seen. To remedy…

HeorhiiS updated 1 year ago
1
pytorch/pytorch #119980

Crash on saving FSDP checkpoint of compiled model

### 🐛 Describe the bug Instantiate a model, wrap the model in FSDP with an autowrap policy, then wrap that FSDP-wrapped model in torch.compile, then try to checkpoint, and you will get a stack trac…

zaptrem updated 7 months ago
11
Lightning-AI/lightning-thunder #866

Different shapes, values of model weights and losses between…

## 🐛 Bug After training Llama-3-8b on 8 A100 for 10 iterations with eager mode I printed the model weights: ``` torch_dist.barrier() weights_after_training = benchmark.model.lm_head.weight[:10].…

mpatel31415 updated 2 months ago
7

上一页 1...37 38 39 40 41 42 43...100 下一页

1000+ results for fsdp

1000+ results
for fsdp