ddp-training Search Results

1000+ results
for ddp-training

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

salesforce/LAVIS #747

Multi GPU training is stalling with 100% GPU utilisation.

I am currently trying to retrain the BLIP2 architecture on a multi gpu setup using the default torch DDP implementation of the Lavis library. My training proceeds fine until some steps with consol…

SouravMzdr updated 6 days ago
3
modelscope/ms-swift #2391

Fine tuning stalling

I am attempting to use the fine tuning with my custom dataset, however the training percentage value keeps staying at 0% and not increasing at all, after 20h of running time: ``` Train: 0%| …

ep0p updated 6 days ago
6
pytorch/pytorch #135358

UserWarning: Grad strides do not match bucket view strides.…

### 🐛 Describe the bug class RFNO(nn.Module): def __init__(self, out_channels=64, modes1=64, modes2=64): super(RFNO, self).__init__() self.out_channels = out_channels …

Moshibing updated 2 months ago
1
libffcv/ffcv #177

Warning while training model with DDP

Hi I am getting the following warning when training the model with ffcv dataloader + ddp. > [W reducer.cpp:362] Warning: Grad strides do not match bucket view strides. This may indicate grad was …

AmmaraRazzaq updated 2 years ago
13
kohya-ss/sd-scripts #1475

Multi GPU train of flux report error

I use this setting below to train flux lora: ``` accelerate launch --gpu_ids 0,1 --main_process_port 29502 --mixed_precision bf16 --num_cpu_threads_per_process=2 \ flux_train_network.py --pr…

chongxian updated 2 days ago
24
NVIDIA/NeMo #11303

OOM with RAM with Lhotse

**Describe the bug** When training a model consuming more memory, I noticed that my training would stop after a constant number of epochs. Upon further investigation, I found that during training / v…

riqiang-dp updated 5 days ago
1
whcpumpkin/Demand-driven-navigation #10

Training memory requirement

I noticed in your reply that you consumed 39G with A100 training, I used four 4090 GPUs for training but still got an error showing Out of memory, I wonder if you could provide a version for multi-GPU…

15057407066 updated 4 days ago
1
pytorch/pytorch #137268

DDP deadlock ProcessGroupNCCL's watchdog got stuck

### 🐛 Describe the bug The process is working correctly with DDP world size 1 but then with world size > 1 is going to hang with GPU 0 at 0% and GPU 1 fixed to max occupancy. I've replicated this bot…

bhack updated 1 day ago
12
AUTOMATIC1111/stable-diffusion-webui #2475

DDP multi-GPU training with PyTorch Lightning

Is there a way to add training for Dreambooth / TI / Hypernetwork training with PyTorch Lightning's trainer class using DDP strategy as featured in @XavierXiao's repo. It allows for a very pain-free e…

JohnnyRacer updated 1 year ago
3
pytorch/pytorch #135511

RuntimeError: Modules with uninitialized parameters can't be…

### 🐛 Describe the bug DDP init call is failing when using subclass of torch.Tensor, same code works with torch.Tensor. Command to run the code python test.py --max-gpus 2 --batch-size 512 --epoch …

ankushjqc updated 2 months ago
6

上一页 1...6 7 8 9 10 11 12...100 下一页

1000+ results for ddp-training

1000+ results
for ddp-training