Unrecognized arguments when using train_db.py command in batch file

FugueSegue commented 12 months ago

For the last several months I've been using a script I wrote that automatically generates a batch file for training. It has been working perfectly until lately. In the process of trying to figure out why SD15 LoRA extraction does not retain the instance token (a separate issue that STILL hasn't been fixed for months, apparently) I did a fresh install of Windows and Kohya. I don't know what has happened but now I can't train anymore using a batch file. I am able to train within the GUI itself. But even when I replicate the command used for the GUI training, it doesn't work.

This is the batch file I tried to use for an SD15 LoRA training:

cmd /k "cd C:\ai\kohya_ss\venv\Scripts & activate & cd.. & cd.. & cd C:\ai\kohya_ss & accelerate launch --num_cpu_threads_per_process=2 "./train_db.py" --pretrained_model_name_or_path="C:\ai\models\checkpoints\v1-5-pruned.safetensors" --train_data_dir="C:\ai\trainings\ridleyd_231116a_g1bc\img" --reg_data_dir="C:\ai\trainings\ridleyd_231116a_g1bc\reg" --resolution="512,512" --output_dir="C:\ai\trainings\ridleyd_231116a_g1bc\model" --logging_dir="C:\ai\trainings\ridleyd_231116a_g1bc\log" --network_alpha="1" --training_comment="daisy ridley woman" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --output_name="ridleyd_231116a_g1bc" --lr_scheduler_num_cycles="10" --no_half_vae --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="10000" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --max_data_loader_n_workers="0" --bucket_reso_steps=64 --bucket_no_upscale --noise_offset=0.0 "

This was the command window output:

`C:\ai\trainings\ridleyd_231116a_g1bc>cmd /k "cd C:\ai\kohya_ss\venv\Scripts & activate & cd.. & cd.. & cd C:\ai\kohya_ss & accelerate launch --num_cpu_threads_per_process=2 "./train_db.py" --pretrained_model_name_or_path="C:\ai\models\checkpoints\v1-5-pruned.safetensors" --train_data_dir="C:\ai\trainings\ridleyd_231116a_g1bc\img" --reg_data_dir="C:\ai\trainings\ridleyd_231116a_g1bc\reg" --resolution="512,512" --output_dir="C:\ai\trainings\ridleyd_231116a_g1bc\model" --logging_dir="C:\ai\trainings\ridleyd_231116a_g1bc\log" --network_alpha="1" --training_comment="daisy ridley woman" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --output_name="ridleyd_231116a_g1bc" --lr_scheduler_num_cycles="10" --no_half_vae --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="10000" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --max_data_loader_n_workers="0" --bucket_reso_steps=64 --bucket_no_upscale --noise_offset=0.0 " usage: train_db.py [-h] [--v2] [--v_parameterization] [--pretrained_model_name_or_path PRETRAINED_MODEL_NAME_OR_PATH] [--tokenizer_cache_dir TOKENIZER_CACHE_DIR] [--train_data_dir TRAIN_DATA_DIR] [--shuffle_caption] [--caption_extension CAPTION_EXTENSION] [--caption_extention CAPTION_EXTENTION] [--keep_tokens KEEP_TOKENS] [--caption_prefix CAPTION_PREFIX] [--caption_suffix CAPTION_SUFFIX] [--color_aug] [--flip_aug] [--face_crop_aug_range FACE_CROP_AUG_RANGE] [--random_crop] [--debug_dataset] [--resolution RESOLUTION] [--cache_latents] [--vae_batch_size VAE_BATCH_SIZE] [--cache_latents_to_disk] [--enable_bucket] [--min_bucket_reso MIN_BUCKET_RESO] [--max_bucket_reso MAX_BUCKET_RESO] [--bucket_reso_steps BUCKET_RESO_STEPS] [--bucket_no_upscale] [--token_warmup_min TOKEN_WARMUP_MIN] [--token_warmup_step TOKEN_WARMUP_STEP] [--dataset_class DATASET_CLASS] [--caption_dropout_rate CAPTION_DROPOUT_RATE] [--caption_dropout_every_n_epochs CAPTION_DROPOUT_EVERY_N_EPOCHS] [--caption_tag_dropout_rate CAPTION_TAG_DROPOUT_RATE] [--reg_data_dir REG_DATA_DIR] [--output_dir OUTPUT_DIR] [--output_name OUTPUT_NAME] [--huggingface_repo_id HUGGINGFACE_REPO_ID] [--huggingface_repo_type HUGGINGFACE_REPO_TYPE] [--huggingface_path_in_repo HUGGINGFACE_PATH_IN_REPO] [--huggingface_token HUGGINGFACE_TOKEN] [--huggingface_repo_visibility HUGGINGFACE_REPO_VISIBILITY] [--save_state_to_huggingface] [--resume_from_huggingface] [--async_upload] [--save_precision {None,float,fp16,bf16}] [--save_every_n_epochs SAVE_EVERY_N_EPOCHS] [--save_every_n_steps SAVE_EVERY_N_STEPS] [--save_n_epoch_ratio SAVE_N_EPOCH_RATIO] [--save_last_n_epochs SAVE_LAST_N_EPOCHS] [--save_last_n_epochs_state SAVE_LAST_N_EPOCHS_STATE] [--save_last_n_steps SAVE_LAST_N_STEPS] [--save_last_n_steps_state SAVE_LAST_N_STEPS_STATE] [--save_state] [--resume RESUME] [--train_batch_size TRAIN_BATCH_SIZE] [--max_token_length {None,150,225}] [--mem_eff_attn] [--xformers] [--sdpa] [--vae VAE] [--max_train_steps MAX_TRAIN_STEPS] [--max_train_epochs MAX_TRAIN_EPOCHS] [--max_data_loader_n_workers MAX_DATA_LOADER_N_WORKERS] [--persistent_data_loader_workers] [--seed SEED] [--gradient_checkpointing] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS] [--mixed_precision {no,fp16,bf16}] [--full_fp16] [--full_bf16] [--ddp_timeout DDP_TIMEOUT] [--clip_skip CLIP_SKIP] [--logging_dir LOGGING_DIR] [--log_with {tensorboard,wandb,all}] [--log_prefix LOG_PREFIX] [--log_tracker_name LOG_TRACKER_NAME] [--log_tracker_config LOG_TRACKER_CONFIG] [--wandb_api_key WANDB_API_KEY] [--noise_offset NOISE_OFFSET] [--multires_noise_iterations MULTIRES_NOISE_ITERATIONS] [--ip_noise_gamma IP_NOISE_GAMMA] [--multires_noise_discount MULTIRES_NOISE_DISCOUNT] [--adaptive_noise_scale ADAPTIVE_NOISE_SCALE] [--zero_terminal_snr] [--min_timestep MIN_TIMESTEP] [--max_timestep MAX_TIMESTEP] [--lowram] [--sample_every_n_steps SAMPLE_EVERY_N_STEPS] [--sample_every_n_epochs SAMPLE_EVERY_N_EPOCHS] [--sample_prompts SAMPLE_PROMPTS] [--sample_sampler {ddim,pndm,lms,euler,euler_a,heun,dpm_2,dpm_2_a,dpmsolver,dpmsolver++,dpmsingle,k_lms,k_euler,k_euler_a,k_dpm_2,k_dpm_2_a}] [--config_file CONFIG_FILE] [--output_config] [--metadata_title METADATA_TITLE] [--metadata_author METADATA_AUTHOR] [--metadata_description METADATA_DESCRIPTION] [--metadata_license METADATA_LICENSE] [--metadata_tags METADATA_TAGS] [--prior_loss_weight PRIOR_LOSS_WEIGHT] [--save_model_as {None,ckpt,safetensors,diffusers,diffusers_safetensors}] [--use_safetensors] [--optimizer_type OPTIMIZER_TYPE] [--use_8bit_adam] [--use_lion_optimizer] [--learning_rate LEARNING_RATE] [--max_grad_norm MAX_GRAD_NORM] [--optimizer_args [OPTIMIZER_ARGS ...]] [--lr_scheduler_type LR_SCHEDULER_TYPE] [--lr_scheduler_args [LR_SCHEDULER_ARGS ...]] [--lr_scheduler LR_SCHEDULER] [--lr_warmup_steps LR_WARMUP_STEPS] [--lr_scheduler_num_cycles LR_SCHEDULER_NUM_CYCLES] [--lr_scheduler_power LR_SCHEDULER_POWER] [--dataset_config DATASET_CONFIG] [--min_snr_gamma MIN_SNR_GAMMA] [--scale_v_pred_loss_like_noise_pred] [--v_pred_like_loss V_PRED_LIKE_LOSS] [--debiased_estimation_loss] [--weighted_captions] [--learning_rate_te LEARNING_RATE_TE] [--no_token_padding] [--stop_text_encoder_training STOP_TEXT_ENCODER_TRAINING] train_db.py: error: unrecognized arguments: --network_alpha=1 --training_comment=daisy ridley woman --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=256 --no_half_vae Traceback (most recent call last): File "C:\Users\SAL\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\SAL\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\ai\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "C:\ai\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\ai\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 986, in launch_command simple_launcher(args) File "C:\ai\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 628, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\ai\kohya_ss\venv\Scripts\python.exe', './train_db.py', '--pretrained_model_name_or_path=C:\ai\models\checkpoints\v1-5-pruned.safetensors', '--train_data_dir=C:\ai\trainings\ridleyd_231116a_g1bc\img', '--reg_data_dir=C:\ai\trainings\ridleyd_231116a_g1bc\reg', '--resolution=512,512', '--output_dir=C:\ai\trainings\ridleyd_231116a_g1bc\model', '--logging_dir=C:\ai\trainings\ridleyd_231116a_g1bc\log', '--network_alpha=1', '--training_comment=daisy ridley woman', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=0.0004', '--unet_lr=0.0004', '--network_dim=256', '--output_name=ridleyd_231116a_g1bc', '--lr_scheduler_num_cycles=10', '--no_half_vae', '--learning_rate=0.0004', '--lr_scheduler=constant', '--train_batch_size=1', '--max_train_steps=10000', '--save_every_n_epochs=1', '--mixed_precision=bf16', '--save_precision=bf16', '--caption_extension=.txt', '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=Adafactor', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--bucket_no_upscale', '--noise_offset=0.0']' returned non-zero exit status 2.

(venv) C:\ai\kohya_ss>

bmaltais commented 12 months ago

I see a trailing " at the end of the command... does it work if you remove it?

FugueSegue commented 12 months ago

I see a trailing " at the end of the command... does it work if you remove it?

Unfortunately, it did not. To the best of my knowledge, I don't believe the formatting of the text within the batch file I use is the problem. In the past, I had been using a nearly identical script to generate the batch files. But recently I've been having problems that started with my discovery that SD15 LoRA extraction from checkpoints do not work. Training and extraction worked fine in the past but not anymore.

I've been reading many Issues posts both here and at the kohya-ss/sd-scripts repo. I understand that there seem to be major problems with training, LoRA extracting, and perhaps other things. I wish I could be better technical help. I'll try to answer questions as best I can.

I can tell you that yesterday I was able train an SD15 LoRA using the GUI interface and it seemed to produce good results. Months ago, I created a technique where I have A4 generate head shots from each LoRA produced during the training and test them against the originals with DeepFace. The results of this training were satisfactory.

Today, as I had done in the past with Dreambooth training, I wrote a script that creates a functional batch file. But no matter what I do, it doesn't seem to work. I exactly replicated the command generated when I manually did the training in the GUI yesterday. I am mystified.

FugueSegue commented 11 months ago

I have discovered the problem. It was an error on my part.

The command I had written in my batch file was for training an SD 1.5 LoRA and I used "./train_db.py" which is the script intended for Dreambooth training. I should have used "./train_network.py" instead because that is the correct script intended for SD 1.5 LoRA.

I wasn't aware of the difference until I read the documentation.

"If all else fails, read the directions."

bmaltais / kohya_ss

Unrecognized arguments when using train_db.py command in batch file #1690