cuda: Out of memory issue on rtx3090 (24GB vram)

dikasterion commented 1 year ago

Hi, I succesfully managed to run your repo based on SD1.5 model. now I'm trying to run SD2.0 768 model but I have CUDA: out of memory error.

I have 23 train images (768*768) in 20_person folder under train_person folder. I tried lowering batch size and disabling cache latents to 0 here's the setting that I put into the powershell with venv(virtual environment)

variable values

$pretrained_model_name_or_path = "D:\stable-diffusion-webui\models\Stable-diffusion\768-v-ema.ckpt" $data_dir = "D:\kohya_ss\zwx_person_db\train_person" $logging_dir = "D:\kohya_ss\log" $output_dir = "D:\kohya_ss\output" $resolution = "768,768" $lr_scheduler="polynomial" $cache_latents = 0 # 1 = true, 0 = false

$image_num = Get-ChildItem $datadir -Recurse -File -Include .png, .jpg, *.webp | Measure-Object | %{$.Count}

Write-Output "image_num: $image_num"

$dataset_repeats = 2000 $learning_rate = 1e-6 $train_batch_size = 1 $epoch = 1 $save_every_n_epochs=1 $mixed_precision="bf16" $num_cpu_threads_per_process=6

You should not have to change values past this point

if ($cache_latents -eq 1) { $cache_latents_value="--cache_latents" } else { $cache_latents_value="" }

$repeats = $image_num $dataset_repeats $mts = [Math]::Ceiling($repeats / $train_batch_size $epoch)

Write-Output "Repeats: $repeats"

cd D:\kohya_ss .\venv\Scripts\activate

accelerate launch --num_cpu_threads_per_process $num_cpu_threads_per_process train_db_fixed.py --v2 --v_parameterization --pretrained_model_name_or_path=$pretrained_model_name_or_path --train_data_dir=$data_dir --output_dir=$output_dir --resolution=$resolution --train_batch_size=$train_batch_size --learning_rate=$learning_rate --max_train_steps=$mts --use_8bit_adam --xformers --mixed_precision=$mixed_precision $cache_latents_value --save_every_n_epochs=$save_every_n_epochs --logging_dir=$logging_dir --save_precision="fp16" --seed=494481440 --lr_scheduler=$lr_scheduler

Add the inference 768v yaml file along with the model for proper loading. Need to have the same name as model... Most likelly "last.yaml" in our case.

cp v2_inference\v2-inference-v.yaml $output_dir"\last.yaml"

and here's the error message I got

steps: 0%| | 0/46000 [00:00<?, ?it/s]epoch 1/100 Traceback (most recent call last): File "D:\kohya_ss[train_db_fixed.py](http://train_db_fixed.py/)", line 2098, in train(args) File "D:\kohya_ss[train_db_fixed.py](http://train_db_fixed.py/)", line 1948, in train optimizer.step() File "D:\kohya_ss\venv\lib\site-packages\accelerate[optimizer.py](http://optimizer.py/)", line 134, in step self.scaler.step(self.optimizer, closure) File "C:\Users\Donny\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\cuda\amp[grad_scaler.py](http://grad_scaler.py/)", line 338, in step retval = self._maybe_opt_step(optimizer, optimizer_state, *args, kwargs) File "C:\Users\Donny\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\cuda\amp[grad_scaler.py](http://grad_scaler.py/)", line 285, in _maybe_opt_step retval = optimizer.step(*args, *kwargs) File "C:\Users\Donny\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\optim[lr_scheduler.py](http://lr_scheduler.py/)", line 65, in wrapper return wrapped(args, kwargs) File "C:\Users\Donny\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\optim[optimizer.py](http://optimizer.py/)", line 113, in wrapper return func(*args, kwargs) File "C:\Users\Donny\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd[grad_mode.py](http://grad_mode.py/)", line 27, in decorate_context return func(*args, *kwargs) File "D:\kohya_ss\venv\lib\site-packages\bitsandbytes\optim[optimizer.py](http://optimizer.py/)", line 263, in step self.init_state(group, p, gindex, pindex) File "C:\Users\Donny\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\autograd[grad_mode.py](http://grad_mode.py/)", line 27, in decorate_context return func(args, kwargs) File "D:\kohya_ss\venv\lib\site-packages\bitsandbytes\optim[optimizer.py](http://optimizer.py/)", line 401, in init_state state["state2"] = torch.zeros_like( RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 24.00 GiB total capacity; 10.13 GiB already allocated; 8.91 GiB free; 10.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF steps: 0%| | 0/46000 [00:30<?, ?it/s] Traceback (most recent call last): File "C:\Users\Donny\AppData\Local\Programs\Python\Python310\lib[runpy.py](http://runpy.py/)", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\Donny\AppData\Local\Programs\Python\Python310\lib[runpy.py](http://runpy.py/)", line 86, in _run_code exec(code, run_globals) File "D:\kohya_ss\venv\Scripts\accelerate.exe[main.py](http://__main__.py/)", line 7, in File "D:\kohya_ss\venv\lib\site-packages\accelerate\commands[accelerate_cli.py](http://accelerate_cli.py/)", line 45, in main args.func(args) File "D:\kohya_ss\venv\lib\site-packages\accelerate\commands[launch.py](http://launch.py/)", line 1069, in launch_command simple_launcher(args) File "D:\kohya_ss\venv\lib\site-packages\accelerate\commands[launch.py](http://launch.py/)", line 551, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['D:\kohya_ss\venv\Scripts\python.exe', 'train_db_fixed.py', '--v2', '--v_parameterization', '--pretrained_model_name_or_path=D:\stable-diffusion-webui\models\Stable-diffusion\768-v-ema.ckpt', '--train_data_dir=D:\kohya_ss\zwx_person_db\train_person', '--output_dir=D:\kohya_ss\output', '--resolution=768,768', '--train_batch_size=1', '--learning_rate=1E-06', '--max_train_steps=46000', '--use_8bit_adam', '--xformers', '--mixed_precision=bf16', '--save_every_n_epochs=1', '--logging_dir=D:\kohya_ss\log', '--save_precision=fp16', '--seed=494481440', '--lr_scheduler=polynomial']' returned non-zero exit status 1.

dikasterion commented 1 year ago

oh I think it's running with global packages, even I've followed all the steps in the readme page... does anyone have any idea why...? I have activated ".\venv\Scripts\activate" and run the setting above with the virtual environment...

toyxyz commented 1 year ago

So how about using 8bit adam, xformers? You can reduce vram usage.

dikasterion commented 1 year ago

So how about using 8bit adam, xformers? You can reduce vram usage.

oh as the setting above, I enabled 8bit anam and xformers.

I finally managed to run v2 768 succesfully. I ran windows powershell with administrator then I can run it with 1 batch size. It fails with 2 or above but I'm content with what I got. thank you guys

bmaltais / kohya_ss

cuda: Out of memory issue on rtx3090 (24GB vram) #3

variable values

You should not have to change values past this point

Add the inference 768v yaml file along with the model for proper loading. Need to have the same name as model... Most likelly "last.yaml" in our case.