I managed to make training on one gpu despite the training code not made for that at the beginning, full fine tuning asks for more than 12Gb of vram, which could be expected but is definetly a big drawback for most users with consumer card GPU, if 12Gb is not enough, then not a lot of cards can benefit from the marketed fast training and inference speed.
File "Wuerstchen\train_stage_B.py", line 379, in <module>
train(0, 1, 1)
File "Wuerstchen\train_stage_B.py", line 252, in train
loss = criterion(pred, latents)
File "v2\train\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "v2\train\lib\site-packages\torch\nn\modules\loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "v2\train\lib\site-packages\torch\nn\functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 11.99 GiB total capacity; 10.84 GiB already allocated; 0 bytes free; 10.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
wandb: Waiting for W&B process to finish... (failed 1). Press Ctrl-C to abort syncing.
I managed to make training on one gpu despite the training code not made for that at the beginning, full fine tuning asks for more than 12Gb of vram, which could be expected but is definetly a big drawback for most users with consumer card GPU, if 12Gb is not enough, then not a lot of cards can benefit from the marketed fast training and inference speed.