fsdp Search Results - Githubissues

1000+ results
for fsdp

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

artidoro/qlora #198

Training on 2x40GB A100s with FSDP: ValueError

Hello, Currently I am trying to run qlora.py script with the 65B model on 2 A100 40GB GPUs with the script ```accelerate launch qlora.py --args``` with ```--args``` the ones given in the rep…

ffohturk updated 1 year ago
5
xfactlab/orpo #18

Loss device for ORPOTrainer

I hit upon an error in HuggingFace for which there are strangely zero google search results "ValueError: Calculated loss must be on the original device" I can see this error source code in huggingf…

ganeshkrishnan1 updated 3 months ago
16
Lightning-AI/pytorch-lightning #19626

FSDPStrategy error when automatic_optimization=False

### Bug description A basic example with MNIST breaks when using the FSDP strategy if using `automatic_optimization=False` + explicit calls to `manual_backward(loss)`. The error seems to stem from…

carlosgjs updated 2 weeks ago
4
huggingface/peft #2141

When I use peft to finetune llama2, the gpu memory keeps gro…

### System Info torch 2.4.1 transformers 4.46.0.dev0 trl 0.11.2 peft 0.13.1 GPU V100 CUDA …

xuanzhangyang updated 1 day ago
6
kohya-ss/sd-scripts #1480

Multi GPU train of flux Have Some Bugs

Multi GPU train flux, Is the script of flux train not supported now?

sx0404 updated 3 weeks ago
23
homebrewltd/research #15

Epic: Test script

We need to set the test script for our training pipeline. - Data generation: @hungphongtrn - [ ] Check the audio generated (audio match the prompt) - [ ] Check the integrity of audio files wi…

hahuyhoang411 updated 2 months ago
1
eric-mitchell/direct-preference-optimization #42

llama7B issue

Hi, i am trying to run the SFT step, using 4 A100 80GB, report error: `starting 4 processes for FSDP training setting RLIMIT_NOFILE soft limit to 1048576 from 1048576 /opt/conda/lib/python3.8/multipr…

JiuhaiChen updated 2 weeks ago
15
microsoft/DeepSpeed #5721

[REQUEST] Asynchronous Checkpointing

**Is your feature request related to a problem? Please describe.** Checkpointing is significantly faster with Torch Distributed's async checkpoint feature: https://pytorch.org/docs/stable/distributed…

zaptrem updated 1 month ago
5
pytorch/pytorch #123952

Update example for FSDP optim_state_dict

### 📚 The doc issue The function signature is `optim_state_dict(model, optim, optim_state_dict=None, group=None)` but the example is calling `optim_state_dict = FSDP.optim_state_dict_to_load(optim_st…

kiddyboots216 updated 6 months ago
2
huggingface/transformers #25695

tensor size mismatch with larger gradient_accumulation_steps…

### System Info A100 Nvidia 80G GPU ### Who can help? _No response_ ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks - [ ] An officially supported task…

yyymeta updated 2 weeks ago
6

上一页 1...25 26 27 28 29 30 31...100 下一页

1000+ results for fsdp

1000+ results
for fsdp