Linaqruf / kohya-trainer

Adapted from https://note.com/kohya_ss/n/nbf7ce8d80f29 for easier cloning
Apache License 2.0
1.82k stars 296 forks source link

Training Lora always stops on 4 epoch of 10 #316

Open PsypmP opened 8 months ago

PsypmP commented 8 months ago

4080 laptop 12gb

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

bla,bla,bla

raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['F:\!NeuroNet\LORA\Kohya\venv\Scripts\python.exe', './sdxl_train_network.py', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--train_data_dir=F:/!NeuroNet/LORA/TRAIN3\img', '--reg_data_dir=F:/!NeuroNet/LORA/TRAIN3\reg', '--resolution=1024,1024', '--output_dir=F:/!NeuroNet/LORA/TRAIN3\model', '--logging_dir=F:/!NeuroNet/LORA/TRAIN3\log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=0.0003', '--unet_lr=0.0003', '--network_dim=128', '--output_name=AlexeyF', '--lr_scheduler_num_cycles=10', '--no_half_vae', '--learning_rate=0.0003', '--lr_scheduler=constant', '--train_batch_size=1', '--max_train_steps=8000', '--save_every_n_epochs=1', '--mixed_precision=bf16', '--save_precision=bf16', '--caption_extension=.txt', '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=Adafactor', '--optimizer_args', 'scale_parameter=False', 'relative_step=False', 'warmup_init=False', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--gradient_checkpointing', '--xformers', '--bucket_no_upscale', '--noise_offset=0.0']' returned non-zero exit status 1.

Config accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" --train_data_dir="F:/!NeuroNet/LORA/TRAIN3\img" --reg_data_dir="F:/!NeuroNet/LORA/TRAIN3\reg" --resolution="1024,1024" --output_dir="F:/!NeuroNet/LORA/TRAIN3\model" --logging_dir="F:/!NeuroNet/LORA/TRAIN3\log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0003 --unet_lr=0.0003 --network_dim=128 --output_name="AlexeyF" --lr_scheduler_num_cycles="10" --no_half_vae --learning_rate="0.0003" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="8000" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0