-
I am currently trying to retrain the BLIP2 architecture on a multi gpu setup using the default torch DDP implementation of the Lavis library.
My training proceeds fine until some steps with consol…
-
I am attempting to use the fine tuning with my custom dataset, however the training percentage value keeps staying at 0% and not increasing at all, after 20h of running time:
```
Train: 0%| …
-
### 🐛 Describe the bug
class RFNO(nn.Module):
def __init__(self, out_channels=64, modes1=64, modes2=64):
super(RFNO, self).__init__()
self.out_channels = out_channels
…
-
Hi
I am getting the following warning when training the model with ffcv dataloader + ddp.
> [W reducer.cpp:362] Warning: Grad strides do not match bucket view strides. This may indicate grad was …
-
I use this setting below to train flux lora:
```
accelerate launch --gpu_ids 0,1 --main_process_port 29502 --mixed_precision bf16 --num_cpu_threads_per_process=2 \
flux_train_network.py --pr…
-
**Describe the bug**
When training a model consuming more memory, I noticed that my training would stop after a constant number of epochs. Upon further investigation, I found that during training / v…
-
I noticed in your reply that you consumed 39G with A100 training, I used four 4090 GPUs for training but still got an error showing Out of memory, I wonder if you could provide a version for multi-GPU…
-
### 🐛 Describe the bug
The process is working correctly with DDP world size 1 but then with world size > 1 is going to hang with GPU 0 at 0% and GPU 1 fixed to max occupancy. I've replicated this bot…
-
Is there a way to add training for Dreambooth / TI / Hypernetwork training with PyTorch Lightning's trainer class using DDP strategy as featured in @XavierXiao's repo. It allows for a very pain-free e…
-
### 🐛 Describe the bug
DDP init call is failing when using subclass of torch.Tensor, same code works with torch.Tensor.
Command to run the code
python test.py --max-gpus 2 --batch-size 512 --epoch …