bmaltais / kohya_ss

Apache License 2.0
9.51k stars 1.22k forks source link

RuntimeError: Distributed package doesn't have NCCL built in / The client socket has failed to connect to [DESKTOP-OSLP67M]:29500 (system error: 10049 - unknown error). #1402

Closed wildcatquebec closed 8 months ago

wildcatquebec commented 1 year ago

Hello, I try many way to run trainning data with LORA. In the setup: THIS MACHINE MULTI-GPU NUM_MACHINE : 1 Dynamo : NO DeepSpeed : NO FullyShardedDataParallel: NO Megatron-LM: NO How Many GPU (I Try 2 (i have a 4070TI + 3080) -Fail, I try 1 -Fail) What GPU by ID : I try all -Fail -i try 00000000:01:00.0 (My 4070ti - Fail) Mixed Precision : bf16

Im running with Python 3.10.9 - Windows 10 Intel i9-10900F 32Go Ram 4070 TI 12Gb VRAM 3080 10Gb VRAM

There is the log:

00:41:00-318240 INFO Start training LoRA Standard ... 00:41:00-319741 INFO Checking for duplicate image filenames in training data directory... 00:41:00-321242 INFO Valid image folder names found in: C:/SylvainTrain\img 00:41:00-322740 INFO Valid image folder names found in: C:/SylvainTrain\reg 00:41:00-324241 INFO Folder 20_Dave Grohl Man: 18 images found 00:41:00-325741 INFO Folder 20_Dave Grohl Man: 360 steps 00:41:00-326741 WARNING Regularisation images are used... Will double the number of steps required... 00:41:00-328243 INFO Total steps: 360 00:41:00-329241 INFO Train batch size: 1 00:41:00-330240 INFO Gradient accumulation steps: 1 00:41:00-332244 INFO Epoch: 10 00:41:00-333240 INFO Regulatization factor: 2 00:41:00-334240 INFO max_train_steps (360 / 1 / 1 10 2) = 7200 00:41:00-335240 INFO stop_text_encoder_training = 0 00:41:00-336241 INFO lr_warmup_steps = 0 00:41:00-337241 INFO Saving training config to C:/SylvainTrain\model\last_20230818-004100.json... 00:41:00-338741 INFO accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="C:/Users/wildc/Downloads/sd_xl_base_1.0_0.9vae.safetensors" --train_data_dir="C:/SylvainTrain\img" --reg_data_dir="C:/SylvainTrain\reg" --resolution="768,768" --output_dir="C:/SylvainTrain\model" --logging_dir="C:/SylvainTrain\log" --network_alpha="12" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0003 --unet_lr=0.0003 --network_dim=24 --output_name="last" --lr_scheduler_num_cycles="10" --no_half_vae --full_bf16 --learning_rate="0.0003" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="7200" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0 [00:41:06] WARNING NOTE: Redirects are currently not supported in Windows or MacOs. redirects.py:27 [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-OSLP67M]:29500 (system error: 10049 - unknown error). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-OSLP67M]:29500 (system error: 10049 - unknown error). A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' prepare tokenizers Using DreamBooth method. prepare images. found directory C:\SylvainTrain\img\20_Dave Grohl Man contains 18 image files found directory C:\SylvainTrain\reg\1_Man contains 1000 image files No caption file found for 1000 images. Training will continue without captions for these images. If class token exists, it will be used. / 1000枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学 習を続行します。class tokenが存在する場合はそれを使います。 C:\SylvainTrain\reg\1_Man\man_0001.jpg C:\SylvainTrain\reg\1_Man\man_0002.jpg C:\SylvainTrain\reg\1_Man\man_0003.jpg C:\SylvainTrain\reg\1_Man\man_0004.jpg C:\SylvainTrain\reg\1_Man\man_0005.jpg C:\SylvainTrain\reg\1_Man\man_0006.jpg... and 995 more 360 train images with repeating. 1000 reg images. some of reg images are not used / 正則化画像の数が多いので、一部使用されない正則化画像があります [Dataset 0] batch_size: 1 resolution: (768, 768) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: True

[Subset 0 of Dataset 0] image_dir: "C:\SylvainTrain\img\20_Dave Grohl Man" image_count: 18 num_repeats: 20 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: Dave Grohl Man caption_extension: .txt

[Subset 1 of Dataset 0] image_dir: "C:\SylvainTrain\reg\1_Man" image_count: 1000 num_repeats: 1 shuffle_caption: False keep_tokens: 0 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: True class_tokens: Man caption_extension: .txt

[Dataset 0] loading image sizes. 100%|██████████████████████████████████████████████████████████████████████████████| 378/378 [00:00<00:00, 7485.34it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (512, 960), count: 20 bucket 1: resolution (640, 832), count: 80 bucket 2: resolution (768, 768), count: 380 bucket 3: resolution (832, 640), count: 220 bucket 4: resolution (1024, 576), count: 20 mean ar error (without repeats): 0.0012617012617012586 Warning: SDXL has been trained with noise_offset=0.0357 / SDXLはnoise_offset=0.0357で学習されています noise_offset is set to 0.0 / noise_offsetが0.0に設定されました preparing accelerator [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-OSLP67M]:29500 (system error: 10049 - unknown error). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-OSLP67M]:29500 (system error: 10049 - unknown error). ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ C:\Users\wildc\Downloads\KOHYA\kohya_ss\sdxl_train_network.py:174 in │ │ │ │ 171 │ args = train_util.read_config_from_file(args, parser) │ │ 172 │ │ │ 173 │ trainer = SdxlNetworkTrainer() │ │ ❱ 174 │ trainer.train(args) │ │ 175 │ │ │ │ C:\Users\wildc\Downloads\KOHYA\kohya_ss\train_network.py:206 in train │ │ │ │ 203 │ │ │ │ 204 │ │ # acceleratorを準備する │ │ 205 │ │ print("preparing accelerator") │ │ ❱ 206 │ │ accelerator = train_util.prepare_accelerator(args) │ │ 207 │ │ is_main_process = accelerator.is_main_process │ │ 208 │ │ │ │ 209 │ │ # mixed precisionに対応した型を用意しておき適宜castする │ │ │ │ C:\Users\wildc\Downloads\KOHYA\kohya_ss\library\train_util.py:3700 in prepare_accelerator │ │ │ │ 3697 │ │ │ if args.wandb_api_key is not None: │ │ 3698 │ │ │ │ wandb.login(key=args.wandb_api_key) │ │ 3699 │ │ │ ❱ 3700 │ accelerator = Accelerator( │ │ 3701 │ │ gradient_accumulation_steps=args.gradient_accumulation_steps, │ │ 3702 │ │ mixed_precision=args.mixed_precision, │ │ 3703 │ │ log_with=log_with, │ │ │ │ C:\Users\wildc\Downloads\KOHYA\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py:361 in │ │ init │ │ │ │ 358 │ │ │ │ │ │ self.fp8_recipe_handler = handler │ │ 359 │ │ │ │ 360 │ │ kwargs = self.init_handler.to_kwargs() if self.init_handler is not None else {} │ │ ❱ 361 │ │ self.state = AcceleratorState( │ │ 362 │ │ │ mixed_precision=mixed_precision, │ │ 363 │ │ │ cpu=cpu, │ │ 364 │ │ │ dynamo_plugin=dynamo_plugin, │ │ │ │ C:\Users\wildc\Downloads\KOHYA\kohya_ss\venv\lib\site-packages\accelerate\state.py:549 in │ │ init │ │ │ │ 546 │ │ if parse_flag_from_env("ACCELERATE_USE_CPU"): │ │ 547 │ │ │ cpu = True │ │ 548 │ │ if PartialState._shared_state == {}: │ │ ❱ 549 │ │ │ PartialState(cpu, kwargs) │ │ 550 │ │ self.dict.update(PartialState._shared_state) │ │ 551 │ │ self._check_initialized(mixed_precision, cpu) │ │ 552 │ │ if not self.initialized: │ │ │ │ C:\Users\wildc\Downloads\KOHYA\kohya_ss\venv\lib\site-packages\accelerate\state.py:143 in │ │ init │ │ │ │ 140 │ │ │ │ │ # Special case for TrainingArguments, where backend will be `Non │ │ 141 │ │ │ │ │ if self.backend is None: │ │ 142 │ │ │ │ │ │ self.backend = "nccl" │ │ ❱ 143 │ │ │ │ │ torch.distributed.init_process_group(backend=self.backend, kwargs) │ │ 144 │ │ │ │ self.num_processes = torch.distributed.get_world_size() │ │ 145 │ │ │ │ self.process_index = torch.distributed.get_rank() │ │ 146 │ │ │ │ self.local_process_index = int(os.environ.get("LOCAL_RANK", -1)) │ │ │ │ C:\Users\wildc\Downloads\KOHYA\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10 │ │ d.py:907 in init_process_group │ │ │ │ 904 │ │ │ # different systems (e.g. RPC) in case the store is multi-tenant. │ │ 905 │ │ │ store = PrefixStore("default_pg", store) │ │ 906 │ │ │ │ ❱ 907 │ │ default_pg = _new_process_group_helper( │ │ 908 │ │ │ world_size, │ │ 909 │ │ │ rank, │ │ 910 │ │ │ [], │ │ │ │ C:\Users\wildc\Downloads\KOHYA\kohya_ss\venv\lib\site-packages\torch\distributed\distributed_c10 │ │ d.py:1013 in _new_process_group_helper │ │ │ │ 1010 │ │ │ backend_type = ProcessGroup.BackendType.GLOO │ │ 1011 │ │ elif backend_str == Backend.NCCL: │ │ 1012 │ │ │ if not is_nccl_available(): │ │ ❱ 1013 │ │ │ │ raise RuntimeError("Distributed package doesn't have NCCL " "built in") │ │ 1014 │ │ │ if pg_options is not None: │ │ 1015 │ │ │ │ assert isinstance( │ │ 1016 │ │ │ │ │ pg_options, ProcessGroupNCCL.Options │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Distributed package doesn't have NCCL built in

vrgz2022 commented 1 year ago

same error ,can some one help?

EthanZoneCoding commented 9 months ago

I was able to manually configure accelerate with setup.bat and use the defaults to fix the issue.