RuntimeError: Distributed package doesn't have NCCL built in / The client socket has failed to connect to [DESKTOP-OSLP67M]:29500 (system error: 10049 - unknown error). #1402
Hello, I try many way to run trainning data with LORA.
In the setup:
THIS MACHINE
MULTI-GPU
NUM_MACHINE : 1
Dynamo : NO
DeepSpeed : NO
FullyShardedDataParallel: NO
Megatron-LM: NO
How Many GPU (I Try 2 (i have a 4070TI + 3080) -Fail, I try 1 -Fail)
What GPU by ID : I try all -Fail -i try 00000000:01:00.0 (My 4070ti - Fail)
Mixed Precision : bf16
Im running with Python 3.10.9 - Windows 10
Intel i9-10900F
32Go Ram
4070 TI 12Gb VRAM
3080 10Gb VRAM
There is the log:
00:41:00-318240 INFO Start training LoRA Standard ...
00:41:00-319741 INFO Checking for duplicate image filenames in training data directory...
00:41:00-321242 INFO Valid image folder names found in: C:/SylvainTrain\img
00:41:00-322740 INFO Valid image folder names found in: C:/SylvainTrain\reg
00:41:00-324241 INFO Folder 20_Dave Grohl Man: 18 images found
00:41:00-325741 INFO Folder 20_Dave Grohl Man: 360 steps
00:41:00-326741 WARNING Regularisation images are used... Will double the number of steps required...
00:41:00-328243 INFO Total steps: 360
00:41:00-329241 INFO Train batch size: 1
00:41:00-330240 INFO Gradient accumulation steps: 1
00:41:00-332244 INFO Epoch: 10
00:41:00-333240 INFO Regulatization factor: 2
00:41:00-334240 INFO max_train_steps (360 / 1 / 1 10 2) = 7200
00:41:00-335240 INFO stop_text_encoder_training = 0
00:41:00-336241 INFO lr_warmup_steps = 0
00:41:00-337241 INFO Saving training config to C:/SylvainTrain\model\last_20230818-004100.json...
00:41:00-338741 INFO accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket
--min_bucket_reso=256 --max_bucket_reso=2048
--pretrained_model_name_or_path="C:/Users/wildc/Downloads/sd_xl_base_1.0_0.9vae.safetensors"
--train_data_dir="C:/SylvainTrain\img" --reg_data_dir="C:/SylvainTrain\reg"
--resolution="768,768" --output_dir="C:/SylvainTrain\model" --logging_dir="C:/SylvainTrain\log"
--network_alpha="12" --save_model_as=safetensors --network_module=networks.lora
--text_encoder_lr=0.0003 --unet_lr=0.0003 --network_dim=24 --output_name="last"
--lr_scheduler_num_cycles="10" --no_half_vae --full_bf16 --learning_rate="0.0003"
--lr_scheduler="constant" --train_batch_size="1" --max_train_steps="7200"
--save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16"
--caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor"
--optimizer_args scale_parameter=False relative_step=False warmup_init=False
--max_data_loader_n_workers="0" --bucket_reso_steps=64 --gradient_checkpointing --xformers
--bucket_no_upscale --noise_offset=0.0
[00:41:06] WARNING NOTE: Redirects are currently not supported in Windows or MacOs. redirects.py:27
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-OSLP67M]:29500 (system error: 10049 - unknown error).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-OSLP67M]:29500 (system error: 10049 - unknown error).
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
prepare tokenizers
Using DreamBooth method.
prepare images.
found directory C:\SylvainTrain\img\20_Dave Grohl Man contains 18 image files
found directory C:\SylvainTrain\reg\1_Man contains 1000 image files
No caption file found for 1000 images. Training will continue without captions for these images. If class token exists, it will be used. / 1000枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学 習を続行します。class tokenが存在する場合はそれを使います。
C:\SylvainTrain\reg\1_Man\man_0001.jpg
C:\SylvainTrain\reg\1_Man\man_0002.jpg
C:\SylvainTrain\reg\1_Man\man_0003.jpg
C:\SylvainTrain\reg\1_Man\man_0004.jpg
C:\SylvainTrain\reg\1_Man\man_0005.jpg
C:\SylvainTrain\reg\1_Man\man_0006.jpg... and 995 more
360 train images with repeating.
1000 reg images.
some of reg images are not used / 正則化画像の数が多いので、一部使用されない正則化画像があります
[Dataset 0]
batch_size: 1
resolution: (768, 768)
enable_bucket: True
min_bucket_reso: 256
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: True
Hello, I try many way to run trainning data with LORA. In the setup: THIS MACHINE MULTI-GPU NUM_MACHINE : 1 Dynamo : NO DeepSpeed : NO FullyShardedDataParallel: NO Megatron-LM: NO How Many GPU (I Try 2 (i have a 4070TI + 3080) -Fail, I try 1 -Fail) What GPU by ID : I try all -Fail -i try 00000000:01:00.0 (My 4070ti - Fail) Mixed Precision : bf16
Im running with Python 3.10.9 - Windows 10 Intel i9-10900F 32Go Ram 4070 TI 12Gb VRAM 3080 10Gb VRAM
There is the log:
00:41:00-318240 INFO Start training LoRA Standard ... 00:41:00-319741 INFO Checking for duplicate image filenames in training data directory... 00:41:00-321242 INFO Valid image folder names found in: C:/SylvainTrain\img 00:41:00-322740 INFO Valid image folder names found in: C:/SylvainTrain\reg 00:41:00-324241 INFO Folder 20_Dave Grohl Man: 18 images found 00:41:00-325741 INFO Folder 20_Dave Grohl Man: 360 steps 00:41:00-326741 WARNING Regularisation images are used... Will double the number of steps required... 00:41:00-328243 INFO Total steps: 360 00:41:00-329241 INFO Train batch size: 1 00:41:00-330240 INFO Gradient accumulation steps: 1 00:41:00-332244 INFO Epoch: 10 00:41:00-333240 INFO Regulatization factor: 2 00:41:00-334240 INFO max_train_steps (360 / 1 / 1 10 2) = 7200 00:41:00-335240 INFO stop_text_encoder_training = 0 00:41:00-336241 INFO lr_warmup_steps = 0 00:41:00-337241 INFO Saving training config to C:/SylvainTrain\model\last_20230818-004100.json... 00:41:00-338741 INFO accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="C:/Users/wildc/Downloads/sd_xl_base_1.0_0.9vae.safetensors" --train_data_dir="C:/SylvainTrain\img" --reg_data_dir="C:/SylvainTrain\reg" --resolution="768,768" --output_dir="C:/SylvainTrain\model" --logging_dir="C:/SylvainTrain\log" --network_alpha="12" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0003 --unet_lr=0.0003 --network_dim=24 --output_name="last" --lr_scheduler_num_cycles="10" --no_half_vae --full_bf16 --learning_rate="0.0003" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="7200" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0 [00:41:06] WARNING NOTE: Redirects are currently not supported in Windows or MacOs. redirects.py:27 [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-OSLP67M]:29500 (system error: 10049 - unknown error). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-OSLP67M]:29500 (system error: 10049 - unknown error). A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' prepare tokenizers Using DreamBooth method. prepare images. found directory C:\SylvainTrain\img\20_Dave Grohl Man contains 18 image files found directory C:\SylvainTrain\reg\1_Man contains 1000 image files No caption file found for 1000 images. Training will continue without captions for these images. If class token exists, it will be used. / 1000枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学 習を続行します。class tokenが存在する場合はそれを使います。 C:\SylvainTrain\reg\1_Man\man_0001.jpg C:\SylvainTrain\reg\1_Man\man_0002.jpg C:\SylvainTrain\reg\1_Man\man_0003.jpg C:\SylvainTrain\reg\1_Man\man_0004.jpg C:\SylvainTrain\reg\1_Man\man_0005.jpg C:\SylvainTrain\reg\1_Man\man_0006.jpg... and 995 more 360 train images with repeating. 1000 reg images. some of reg images are not used / 正則化画像の数が多いので、一部使用されない正則化画像があります [Dataset 0] batch_size: 1 resolution: (768, 768) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: True
[Subset 0 of Dataset 0] image_dir: "C:\SylvainTrain\img\20_Dave Grohl Man" image_count: 18 num_repeats: 20 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: Dave Grohl Man caption_extension: .txt
[Subset 1 of Dataset 0] image_dir: "C:\SylvainTrain\reg\1_Man" image_count: 1000 num_repeats: 1 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: True class_tokens: Man caption_extension: .txt
[Dataset 0] loading image sizes. 100%|██████████████████████████████████████████████████████████████████████████████| 378/378 [00:00<00:00, 7485.34it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (512, 960), count: 20 bucket 1: resolution (640, 832), count: 80 bucket 2: resolution (768, 768), count: 380 bucket 3: resolution (832, 640), count: 220 bucket 4: resolution (1024, 576), count: 20 mean ar error (without repeats): 0.0012617012617012586 Warning: SDXL has been trained with noise_offset=0.0357 / SDXLはnoise_offset=0.0357で学習されています noise_offset is set to 0.0 / noise_offsetが0.0に設定されました preparing accelerator [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-OSLP67M]:29500 (system error: 10049 - unknown error). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-OSLP67M]:29500 (system error: 10049 - unknown error). ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ C:\Users\wildc\Downloads\KOHYA\kohya_ss\sdxl_train_network.py:174 in │
│ │
│ 171 │ args = train_util.read_config_from_file(args, parser) │
│ 172 │ │
│ 173 │ trainer = SdxlNetworkTrainer() │
│ ❱ 174 │ trainer.train(args) │
│ 175 │
│ │
│ C:\Users\wildc\Downloads\KOHYA\kohya_ss\train_network.py:206 in train │
│ │
│ 203 │ │ │
│ 204 │ │ # acceleratorを準備する │
│ 205 │ │ print("preparing accelerator") │
│ ❱ 206 │ │ accelerator = train_util.prepare_accelerator(args) │
│ 207 │ │ is_main_process = accelerator.is_main_process │
│ 208 │ │ │
│ 209 │ │ # mixed precisionに対応した型を用意しておき適宜castする │
│ │
│ C:\Users\wildc\Downloads\KOHYA\kohya_ss\library\train_util.py:3700 in prepare_accelerator │
│ │
│ 3697 │ │ │ if args.wandb_api_key is not None: │
│ 3698 │ │ │ │ wandb.login(key=args.wandb_api_key) │
│ 3699 │ │
│ ❱ 3700 │ accelerator = Accelerator( │
│ 3701 │ │ gradient_accumulation_steps=args.gradient_accumulation_steps, │
│ 3702 │ │ mixed_precision=args.mixed_precision, │
│ 3703 │ │ log_with=log_with, │
│ │
│ C:\Users\wildc\Downloads\KOHYA\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py:361 in │
│ init │
│ │
│ 358 │ │ │ │ │ │ self.fp8_recipe_handler = handler │
│ 359 │ │ │
│ 360 │ │ kwargs = self.init_handler.to_kwargs() if self.init_handler is not None else {} │
│ ❱ 361 │ │ self.state = AcceleratorState( │
│ 362 │ │ │ mixed_precision=mixed_precision, │
│ 363 │ │ │ cpu=cpu, │
│ 364 │ │ │ dynamo_plugin=dynamo_plugin, │
│ │
│ C:\Users\wildc\Downloads\KOHYA\kohya_ss\venv\lib\site-packages\accelerate\state.py:549 in │
│ init │
│ │
│ 546 │ │ if parse_flag_from_env("ACCELERATE_USE_CPU"): │
│ 547 │ │ │ cpu = True │
│ 548 │ │ if PartialState._shared_state == {}: │
│ ❱ 549 │ │ │ PartialState(cpu, kwargs) │
│ 550 │ │ self.dict.update(PartialState._shared_state) │
│ 551 │ │ self._check_initialized(mixed_precision, cpu) │
│ 552 │ │ if not self.initialized: │
│ │
│ C:\Users\wildc\Downloads\KOHYA\kohya_ss\venv\lib\site-packages\accelerate\state.py:143 in │
│ init │
│ │
│ 140 │ │ │ │ │ # Special case for
TrainingArguments
, wherebackend
will be `Non │ │ 141 │ │ │ │ │ if self.backend is None: │ │ 142 │ │ │ │ │ │ self.backend = "nccl" │ │ ❱ 143 │ │ │ │ │ torch.distributed.init_process_group(backend=self.backend, kwargs) │ │ 144 │ │ │ │ self.num_processes = torch.distributed.get_world_size() │ │ 145 │ │ │ │ self.process_index = torch.distributed.get_rank() │ │ 146 │ │ │ │ self.local_process_index = int(os.environ.get("LOCAL_RANK", -1)) │ │ │ │ C:\Users\wildc\Downloads\KOHYA\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10 │ │ d.py:907 in init_process_group │ │ │ │ 904 │ │ │ # different systems (e.g. RPC) in case the store is multi-tenant. │ │ 905 │ │ │ store = PrefixStore("default_pg", store) │ │ 906 │ │ │ │ ❱ 907 │ │ default_pg = _new_process_group_helper( │ │ 908 │ │ │ world_size, │ │ 909 │ │ │ rank, │ │ 910 │ │ │ [], │ │ │ │ C:\Users\wildc\Downloads\KOHYA\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10 │ │ d.py:1013 in _new_process_group_helper │ │ │ │ 1010 │ │ │ backend_type = ProcessGroup.BackendType.GLOO │ │ 1011 │ │ elif backend_str == Backend.NCCL: │ │ 1012 │ │ │ if not is_nccl_available(): │ │ ❱ 1013 │ │ │ │ raise RuntimeError("Distributed package doesn't have NCCL " "built in") │ │ 1014 │ │ │ if pg_options is not None: │ │ 1015 │ │ │ │ assert isinstance( │ │ 1016 │ │ │ │ │ pg_options, ProcessGroupNCCL.Options │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Distributed package doesn't have NCCL built in