fsdp Search Results - Githubissues

1000+ results
for fsdp

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

bitsandbytes-foundation/bitsandbytes #782

Trouble resuming from checkpoint

Hello, I'm running `bitsandbytes==0.41.1` in a Python 3.10 miniconda environment, 8xA100 GPU (using `accelerate` for multi-GPU), Cuda 12.2. I'm having problems resuming training (DPO) from a ch…

asaluja updated 11 hours ago
11
Lightning-AI/lightning-thunder #257

Weight tying + FSDP = out of bounds

## 🐛 Bug ### To Reproduce Code: ```python import os import torch import torch.distributed as tdist import thunder from thunder.tests.litgpt_model import GPT, Config if __name__ == "__…

carmocca updated 4 months ago
4
hiyouga/LLaMA-Factory #4785

FSDP-QLora w/ DeepSeek-v2-lite dones't work on 4 GPUs

### Reminder - [X] I have read the README and searched the existing issues. ### System Info [2024-07-12 02:22:28,334] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda…

Jiayi-Pan updated 3 weeks ago
4
pytorch/pytorch #119937

FSDP + TP Initialization Error

### 🐛 Describe the bug I have a custom implementation of TP which uses device mesh to lay out the tensors and then dtensors it. I then pass in the device mesh into FSDP for wrapping. Concretely, I am…

mvpatel2000 updated 4 months ago
32
Lightning-AI/pytorch-lightning #19494

FSDP hybrid shard should checkpoint in a single node

### Description & Motivation https://github.com/pytorch/pytorch/pull/104810 adds the recommendation that the `save` APIs should be called in a single node (`shard_group`). https://github.com/pyt…

carmocca updated 2 months ago
3
Lightning-AI/pytorch-lightning #19462

FSDP checkpointing uses deprecated APIs with PyTorch 2.2

### Bug description See added deprecation warnings in https://github.com/pytorch/pytorch/pull/113867 ### What version are you seeing the problem on? v2.2 ### How to reproduce the bug Or…

carmocca updated 2 months ago
6
AnswerDotAI/fsdp_qlora #37

train.py script crashes when using HQQ

Here's the command I ran: ``` python train.py \ --model_name meta-llama/Llama-2-70b-hf \ --batch_size 1 \ --context_length 1024 \ --precision bf16 \ --train_type hqq_lora \ --use_gradient_ch…

rationalism updated 5 months ago
3
axolotl-ai-cloud/axolotl #1426

run examples/llama-2/qlora-fsdp.yml failed

### Please check that this issue hasn't been reported before. - [X] I searched previous [Bug Reports](https://github.com/OpenAccess-AI-Collective/axolotl/labels/bug) didn't find any similar reports…

XiepengLi updated 6 months ago
3
Lightning-AI/pytorch-lightning #18679

Support combinations of precision plugins

### Description & Motivation Both the Fabric and Trainer strategies are designed to have a single plugin enabled from the beginning to the end of the program. This has been fine historically, ho…

carmocca updated 3 weeks ago
13
instructlab/instructlab #2218

[Epic] Add Gaudi support to InstructLab CLI, eval, and train…

**Feature Overview (aka. Goal Summary)** Implement Intel Gaudi support in InstructLab project, so Gaudi 2 and Gaudi 3 can be used for SDG, evaluation, and training. **Goals (aka. expected user out…

ktam3 updated 1 week ago
1

上一页 1...15 16 17 18 19 20 21...100 下一页

1000+ results for fsdp

1000+ results
for fsdp