[Bug]: {LORA TRAINING} RuntimeError: CUDA error: the launch timed out and was terminated

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

training LoRa's...

Error below:

RuntimeError: CUDA error: the launch timed out and was terminated CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

steps: 0%| | 3/5400 [02:26<73:09:01, 48.79s/it, loss=0.122]

Steps to reproduce the problem

Go to (kohya>gui)
Start training
Error occurs

Current version:

15:13:32-093626 INFO Version: v21.8.7

15:13:32-099628 INFO nVidia toolkit detected 15:13:33-744996 INFO Torch 2.0.1+cu118 15:13:33-761005 INFO Torch backend: nVidia CUDA 11.8 cuDNN 8700 15:13:33-763005 INFO Torch detected GPU: NVIDIA GeForce RTX 4060 Ti VRAM 8187 Arch (8, 9) Cores 34 15:13:33-764006 INFO Verifying modules instalation status from requirements_windows_torch2.txt... 15:13:33-766506 INFO Verifying modules instalation status from requirements.txt... 15:13:36-493311 INFO headless: False 15:13:36-496357 INFO Load CSS... Running on local URL: http://127.0.0.1:7860

What should have happened?

LoRA training

Version or Commit where the problem happens

n/a

What Python version are you running on ?

Python 3.10.x

What platforms do you use to access the UI ?

Windows

What device are you running WebUI on?

Nvidia GPUs (RTX 20 above)

Cross attention optimization

xformers

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

@echo off

set PYTHON=
set GIT=
set VENV_DIR=
set COMMANDLINE_ARGS=--xformers --no-half-vae
set CUDA_LAUNCH_BLOCKING=1
call webui.bat

List of extensions

n/a

Console logs

Error below:

RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

steps:   0%|                                                           | 3/5400 [02:26<73:09:01, 48.79s/it, loss=0.122]

Additional information

Additional Specs (16x4) Total 64 GB RAM sticks 4060 Ti Nvidia RTX GPU 8 Gb

AUTOMATIC1111 / stable-diffusion-webui