ddp-training Search Results

1000+ results
for ddp-training

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

AutoGPTQ/AutoGPTQ #461

why "target_modules" does not recognize any parameters?

``` model = AutoGPTQForCausalLM.from_quantized( model_name, #use_triton=True, #warmup_triton=False, trainable=True, inject_fused_attention=False, …

daehuikim updated 10 months ago
7
huggingface/alignment-handbook #55

Running on single GPU(16GB)

Hi, What is the best way to run this on my high performance laptop? Should this somehow work? Can i calculate how many days/weeks it will run? Thanks in advance Specs: > OS: Win 11 (WSL2…

patchie updated 11 months ago
1
pytorch/pytorch #91879

ddp vs fsdp

### 🐛 Describe the bug I used fsdp+ShardedGradScaler to train my model. Compared with apex. amp+ddp, the precision of my model has decreased. The ddp is like ``` model, optimizer = amp.initial…

chexiangying updated 1 year ago
9
mosaicml/streaming #332

Shared Memory issue with multiple instances of Streaming Dat…

## Environment - OS: [Ubuntu 23.06.30] - Hardware (GPU, or instance type): [8xV100] ## The issue I am trying Streaming Dataset with [Pytorch Lightning](https://lightning.ai/docs/pytorch/…

shivshandilya updated 5 months ago
23
modelscope/ms-swift #2044

DPO training error `RuntimeError: Expected all tensors to be…

**Describe the bug** Getting the following error only by changing the model to `llava-onevision-qwen2-0_5b-ov` from `llava1_6-mistral-7b-instruct` in the first DPO example [here](https://github.com/m…

Lopa07 updated 2 months ago
6
Lightning-AI/pytorch-lightning #17937

DDP: moving model to CPU and back to GPU breaks gradient syn…

### Bug description Gradient synchronisation in `fabric.backward()` is broken when moving a model back to CPU and back again to GPU. Moving a model temporarily to CPU is useful when GPU resource…

vlievin updated 1 year ago
4
pytorch/pytorch #111187

torch.distributed.DistBackendError: NCCL error in: ../torch/…

### 🐛 Describe the bug When I try to finetune with ddp([LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)) in wsl2(win10 host), I get this error: ``` DESKTOP-VMBL43V:1354:1354 [0] NCCL INFO …

mlinmg updated 2 weeks ago
17
BlinkDL/RWKV-LM #89

Create RWKV language model from config, not loading from fil…

I saw some code under [RWKV-LM/RWKV-v4neo/src/model.py](https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4neo/src/model.py) which requires CUDA to create RWKV model. I want to change the code by …

James4Ever0 updated 1 year ago
1
open-mmlab/mmdetection #10515

RuntimeError: Expected to have finished reduction in the pri…

Adding my own modification part to the official code found that it can run in single card mode, but it cannot run in multi-card case, what should I do? ![image](https://github.com/open-mmlab/mmdete…

notfacezhi updated 6 months ago
4
huggingface/transformers #34242

Add DDP token averaging for equivalent non-parallel training…

### Feature request Token averaging in gradient accumulation was fixed in #34191 . But token averaging in DDP seems to have the same issue. --- ## Expected behaivor With all the tokens contr…

sbwww updated 1 month ago
11

上一页 1...94 95 96 97 98 99 100...100 下一页

1000+ results for ddp-training

1000+ results
for ddp-training