Closed magicwang1111 closed 8 months ago
system linux Fri Jul 14 18:38:50 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:00:07.0 Off | 0 | | N/A 30C P0 52W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Same problem for me. Appears to be deprecated with logging_dir https://github.com/huggingface/accelerate/issues/1619 Have you tried to set logging_dir
18:36:25-585000 INFO Start training LoRA Kohya DyLoRA ... 18:36:25-586000 INFO Valid image folder names found in: /home/wangwang/ai/stable-diffusion-webui/train/0618 18:36:25-593201 INFO Folder 5_aliproduct: 6905 images found 18:36:25-593957 INFO Folder 5_aliproduct: 34525 steps 18:36:25-594630 INFO Total steps: 34525 18:36:25-595247 INFO Train batch size: 32 18:36:25-595843 INFO Gradient accumulation steps: 1.0 18:36:25-596458 INFO Epoch: 10 18:36:25-597038 INFO Regulatization factor: 1 18:36:25-597649 INFO max_train_steps (34525 / 32 / 1.0 10 1) = 10790 18:36:25-598415 INFO stop_text_encoder_training = 0 18:36:25-599029 INFO lr_warmup_steps = 1079 18:36:25-599787 INFO Saving training config to /home/wangwang/ai/stable-diffusion-webui/stable-diffusion-webui/models/LyCORIS/aliproduct0714_20230714-183625.json... 18:36:25-600760 INFO accelerate launch --num_cpu_threads_per_process=2 "./train_network.py" --enable_bucket --pretrained_model_name_or_path="/home/wangwang/ai/stable-diffusion-webui/stable-diffusion-webui/models/Stable-diffusion/v1-5-pru ned.ckpt" --train_data_dir="/home/wangwang/ai/stable-diffusion-webui/train/0618" --resolution="640,960" --output_dir="/home/wangwang/ai/stable-diffusion-webui/stable-diffusion-webui/models/LyCORIS" --network_alpha="256" --save_model_as=safetensors --network_module=networks.dylora --network_args conv_dim="32" conv_alpha="32" unit="8" rank_dropout="0.1" --text_encoder_lr=2e-06 --unet_lr=2e-05 --network_dim=256 --output_name="aliproduct0714" --lr_scheduler_num_cycles="10" --network_dropout="0.1" --learning_rate="2e-05" --lr_scheduler="constant_with_warmup" --lr_warmup_steps="1079" --train_batch_size="32" --max_train_steps="10790" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --seed="1234" --cache_latents --cache_latents_to_disk --optimizer_type="Prodigy" --max_data_loader_n_workers="0" --max_token_length=225 --resume="/home/wangwang/test/kohya_ss/last_staus" --keep_tokens="1" --bucket_reso_steps=64 --save_state --shuffle_caption --gradient_checkpointing --full_fp16 --xformers --bucket_no_upscale --noise_offset=0.1 --wandb_api_key="22ddbffd5936bbb30f5c8404cf885890885514cf" --sample_sampler=euler_a --sample_prompts="/home/wangwang/ai/stable-diffusion-webui/stable-diffusion-webui/models/LyCORIS/sample/prompt.txt" --sample_every_n_epochs="1" --sample_every_n_steps="40" 2023-07-14 18:36:26.807469: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0
. 2023-07-14 18:36:26.845291: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-07-14 18:36:27.377273: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [18:36:28] WARNING The following values were not passed toaccelerate launch
and had defaults used instead: launch.py:1088--num_processes
was set to a value of1
--num_machines
was set to a value of1
--mixed_precision
was set to a value of'no'
--dynamo_backend
was set to a value of'no'
To avoid this warning pass in values for each of the problematic parameters or runaccelerate config
. 2023-07-14 18:36:30.041045: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT prepare tokenizer update token length: 225 Using DreamBooth method. prepare images. found directory /home/wangwang/ai/stable-diffusion-webui/train/0618/5_aliproduct contains 6905 image files No caption file found for 6905 images. Training will continue without captions for these images. If class token exists, it will be used. / 6905枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します。class tokenが存在する場合はそれを使います。 /home/wangwang/ai/stable-diffusion-webui/train/0618/5_aliproduct/0.png /home/wangwang/ai/stable-diffusion-webui/train/0618/5_aliproduct/1.png /home/wangwang/ai/stable-diffusion-webui/train/0618/5_aliproduct/10.png /home/wangwang/ai/stable-diffusion-webui/train/0618/5_aliproduct/100.png /home/wangwang/ai/stable-diffusion-webui/train/0618/5_aliproduct/1000.png /home/wangwang/ai/stable-diffusion-webui/train/0618/5_aliproduct/1001.png... and 6900 more 34525 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 32 resolution: (640, 960) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1024 bucket_reso_steps: 64 bucket_no_upscale: True[Subset 0 of Dataset 0] image_dir: "/home/wangwang/ai/stable-diffusion-webui/train/0618/5_aliproduct" image_count: 6905 num_repeats: 5 shuffle_caption: True keep_tokens: 1 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: aliproduct caption_extension: .caption
[Dataset 0] loading image sizes. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6905/6905 [00:00<00:00, 35407.84it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (512, 512), count: 135 bucket 1: resolution (576, 896), count: 5 bucket 2: resolution (640, 960), count: 32180 bucket 3: resolution (768, 768), count: 2205 mean ar error (without repeats): 3.5003452816631544e-06 preparing accelerator ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/wangwang/test/kohya_ss/./train_network.py:974 in │
│ │
│ 971 │ args = train_util.read_config_from_file(args, parser) │
│ 972 │ │
│ 973 │ trainer = NetworkTrainer() │
│ ❱ 974 │ trainer.train(args) │
│ 975 │
│ │
│ /home/wangwang/test/kohya_ss/./train_network.py:205 in train │
│ │
│ 202 │ │ │
│ 203 │ │ # acceleratorを準備する │
│ 204 │ │ print("preparing accelerator") │
│ ❱ 205 │ │ accelerator = train_util.prepare_accelerator(args) │
│ 206 │ │ is_main_process = accelerator.is_main_process │
│ 207 │ │ │
│ 208 │ │ # mixed precisionに対応した型を用意しておき適宜castする │
│ │
│ /home/wangwang/test/kohya_ss/library/train_util.py:3266 in prepare_accelerator │
│ │
│ 3263 │ │ │ if args.wandb_api_key is not None: │
│ 3264 │ │ │ │ wandb.login(key=args.wandb_api_key) │
│ 3265 │ │
│ ❱ 3266 │ accelerator = Accelerator( │
│ 3267 │ │ gradient_accumulation_steps=args.gradient_accumulation_steps, │
│ 3268 │ │ mixed_precision=args.mixed_precision, │
│ 3269 │ │ log_with=log_with, │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: Accelerator.init() got an unexpected keyword argument 'project_dir'
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/wangwang/test/kohya_ss/venv/bin/accelerate:8 in │
│ │
│ 5 from accelerate.commands.accelerate_cli import main │
│ 6 if name == 'main': │
│ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │
│ ❱ 8 │ sys.exit(main()) │
│ 9 │
│ │
│ /home/wangwang/test/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cl │
│ i.py:45 in main │
│ │
│ 42 │ │ exit(1) │
│ 43 │ │
│ 44 │ # Run │
│ ❱ 45 │ args.func(args) │
│ 46 │
│ 47 │
│ 48 if name == "main": │
│ │
│ /home/wangwang/test/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py:110 │
│ 4 in launch_command │
│ │
│ 1101 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │
│ 1102 │ │ sagemaker_launcher(defaults, args) │
│ 1103 │ else: │
│ ❱ 1104 │ │ simple_launcher(args) │
│ 1105 │
│ 1106 │
│ 1107 def main(): │
│ │
│ /home/wangwang/test/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py:567 │
│ in simple_launcher │
│ │
│ 564 │ process = subprocess.Popen(cmd, env=current_env) │
│ 565 │ process.wait() │
│ 566 │ if process.returncode != 0: │
│ ❱ 567 │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │
│ 568 │
│ 569 │
│ 570 def multi_gpu_launcher(args): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['/home/wangwang/test/kohya_ss/venv/bin/python', './train_network.py', '--enable_bucket',
'--pretrained_model_name_or_path=/home/wangwang/ai/stable-diffusion-webui/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned.ckpt',
'--train_data_dir=/home/wangwang/ai/stable-diffusion-webui/train/0618', '--resolution=640,960',
'--output_dir=/home/wangwang/ai/stable-diffusion-webui/stable-diffusion-webui/models/LyCORIS', '--network_alpha=256', '--save_model_as=safetensors',
'--network_module=networks.dylora', '--network_args', 'conv_dim=32', 'conv_alpha=32', 'unit=8', 'rank_dropout=0.1', '--text_encoder_lr=2e-06',
'--unet_lr=2e-05', '--network_dim=256', '--output_name=aliproduct0714', '--lr_scheduler_num_cycles=10', '--network_dropout=0.1', '--learning_rate=2e-05',
'--lr_scheduler=constant_with_warmup', '--lr_warmup_steps=1079', '--train_batch_size=32', '--max_train_steps=10790', '--save_every_n_epochs=1',
'--mixed_precision=fp16', '--save_precision=fp16', '--seed=1234', '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=Prodigy',
'--max_data_loader_n_workers=0', '--max_token_length=225', '--resume=/home/wangwang/test/kohya_ss/last_staus', '--keep_tokens=1',
'--bucket_reso_steps=64', '--save_state', '--shuffle_caption', '--gradient_checkpointing', '--full_fp16', '--xformers', '--bucket_no_upscale',
'--noise_offset=0.1', '--wandb_api_key=22ddbffd5936bbb30f5c8404cf885890885514cf', '--sample_sampler=euler_a',
'--sample_prompts=/home/wangwang/ai/stable-diffusion-webui/stable-diffusion-webui/models/LyCORIS/sample/prompt.txt', '--sample_every_n_epochs=1',
'--sample_every_n_steps=40']' returned non-zero exit status 1.