error while training - Githubissues

bolala520 commented 1 year ago

Hello, The installation was fine, however when I begin to train lora, I get an error. Setup is fine everything matching and looking like this:

Folder 100_menglan : 600 steps max_train_steps = 300 stop_text_encoder_training = 0 lr_warmup_steps = 0 accelerate launch --num_cpu_threads_per_process=2 "train_db.py" --enable_bucket --pretrained_model_name_or_path="E:/novelai-webui-aki-v3A/models/Stable-diffusion/dreamshaper_331BakedVae.safetensors" --train_data_dir="D:/lora/img/" --resolution=512,512 --output_dir="E:/novelai-webui-aki-v3A/embeddings" --logging_dir="D:/lora/log" --save_model_as=safetensors --output_name="menglan_v1.0" --max_data_loader_n_workers="1" --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="2" --max_train_steps="300" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --seed="1234" --caption_extension=".txt" --cache_latents --optimizer_type="AdamW8bit" --max_data_loader_n_workers="1" --clip_skip=2 --bucket_reso_steps=64 --xformers --bucket_no_upscale NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [license.insydium.net]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [license.insydium.net]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。). Could not find module 'E:\LORA\kohya_ss\venv\Lib\site-packages\xformers_C.pyd' (or one of its dependencies). Try using the full path with constructor syntax. WARNING:root:WARNING: Could not find module 'E:\LORA\kohya_ss\venv\Lib\site-packages\xformers_C.pyd' (or one of its dependencies). Try using the full path with constructor syntax. Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop prepare tokenizer prepare images. found directory D:\lora\img\100_menglan contains 6 image files 600 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 2 resolution: (512, 512) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1024 bucket_reso_steps: 64 bucket_no_upscale: True

[Subset 0 of Dataset 0] image_dir: "D:\lora\img\100_menglan" image_count: 6 num_repeats: 100 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: menglan caption_extension: .txt

[Dataset 0] loading image sizes. 100%|████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 859.17it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む） bucket 0: resolution (384, 512), count: 100 bucket 1: resolution (384, 640), count: 200 bucket 2: resolution (448, 448), count: 100 bucket 3: resolution (448, 512), count: 100 bucket 4: resolution (512, 448), count: 100 mean ar error (without repeats): 0.03593309248141693 prepare accelerator [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [license.insydium.net]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [license.insydium.net]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。). Traceback (most recent call last): File "E:\LORA\kohya_ss\train_db.py", line 427, in train(args) File "E:\LORA\kohya_ss\train_db.py", line 89, in train accelerator, unwrap_model = train_util.prepare_accelerator(args) File "E:\LORA\kohya_ss\library\train_util.py", line 2692, in prepare_accelerator accelerator = Accelerator( File "E:\LORA\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 308, in init self.state = AcceleratorState( File "E:\LORA\kohya_ss\venv\lib\site-packages\accelerate\state.py", line 150, in init torch.distributed.init_process_group(backend="nccl", **kwargs) File "E:\LORA\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 895, in init_process_group default_pg = _new_process_group_helper( File "E:\LORA\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 998, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7368) of binary: E:\LORA\kohya_ss\venv\Scripts\python.exe

bmaltais commented 1 year ago

You should report this issue directly to kohya as this is something related to his python code and not the GUI. I can't really help with this error. I have never seen it before.

oliverban commented 1 year ago

I'm getting the same in the most recent version. Fresh install with torch 2. Haven't seen this before.

bmaltais / kohya_ss

error while training #611