bmaltais / kohya_ss

Apache License 2.0
9.42k stars 1.22k forks source link

LORA training does not start, it keeps crashing - W socket.cpp:663] [c10d] The client socket has failed to connect to [DESKTOP-413GD2B]:12345 #2581

Closed martindellavecchia closed 3 months ago

martindellavecchia commented 3 months ago

I created a rig for multi GPU LORA training, but it never worked, kohya crashed prior starting the training. I wipped out my entire system, as i thought i was an OS or python issue, so I reinstalled just the OS with kohya and the latest nvidia drivers.I also disabled the rest of the gpus, to make sure they are not the cause of the issue, leaving just the 2080ti, but I keep receiving this error:

`19:27:16-339951 INFO Command executed. [2024-06-09 19:27:21,852] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [DESKTOP-413GD2B]:12345 (system error: 10049 - La direcci¾n solicitada no es vßlida en este contexto.). Traceback (most recent call last): Traceback (most recent call last): File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1390, in _get_module File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1390, in _get_module return importlib.import_module("." + module_name, self.name) return importlib.import_module("." + module_name, self.name) File "C:\Program Files\Python310\lib\importlib__init__.py", line 126, in import_module

  File "C:\Program Files\Python310\lib\importlib\__init__.py", line 126, in import_module

return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "", line 241, in _call_with_frames_removed File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\models\clip\image_processing_clip.py", line 21, in File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\models\clip\image_processing_clip.py", line 21, in from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\image_processing_utils.py", line 28, in

  File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\image_processing_utils.py", line 28, in <module>

from .image_transforms import center_crop, normalize, rescale from .image_transforms import center_crop, normalize, rescale File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\image_transforms.py", line 47, in

  File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\image_transforms.py", line 47, in <module>

import tensorflow as tf import tensorflow as tf File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\tensorflow__init__.py", line 42, in

  File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\tensorflow\__init__.py", line 42, in <module>

from tensorflow.python import tf2 as _tf2 from tensorflow.python import tf2 as _tf2

File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\tensorflow\python\tf2.py", line 21, in File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\tensorflow\python\tf2.py", line 21, in from tensorflow.python.platform import _pywrap_tf2from tensorflow.python.platform import _pywrap_tf2

ImportErrorImportError: DLL load failed while importing _pywrap_tf2: Error en una rutina de inicialización de biblioteca de vínculos dinámicos (DLL).: DLL load failed while importing _pywrap_tf2: Error en una rutina de inicialización de biblioteca de vínculos dinámicos (DLL). The above exception was the direct cause of the following exception:

Traceback (most recent call last):

The above exception was the direct cause of the following exception:

File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 710, in _get_module Traceback (most recent call last): File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 710, in _get_module return importlib.import_module("." + module_name, self.name) return importlib.import_module("." + module_name, self.name) File "C:\Program Files\Python310\lib\importlib__init__.py", line 126, in import_module

  File "C:\Program Files\Python310\lib\importlib\__init__.py", line 126, in import_module

return _bootstrap._gcd_import(name[level:], package, level) return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import

File "", line 1027, in _find_and_load File "", line 1050, in _gcd_import File "", line 1006, in _find_and_load_unlocked File "", line 1027, in _find_and_load File "", line 688, in _load_unlocked File "", line 1006, in _find_and_load_unlocked File "", line 883, in exec_module File "", line 688, in _load_unlocked File "", line 241, in _call_with_frames_removed File "", line 883, in exec_module File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion.py", line 20, in File "", line 241, in _call_with_frames_removed from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer, CLIPVisionModelWithProjection File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion.py", line 20, in

  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist

from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer, CLIPVisionModelWithProjection File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1381, in getattr File "", line 1075, in _handle_fromlist File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1381, in getattr value = getattr(module, name) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1380, in getattr value = getattr(module, name) module = self._get_module(self._class_to_module[name])

File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1392, in _get_module File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1380, in getattr raise RuntimeError( RuntimeError : Failed to import transformers.models.clip.image_processing_clip because of the following error (look up to see its traceback): DLL load failed while importing _pywrap_tf2: Error en una rutina de inicialización de biblioteca de vínculos dinámicos (DLL).module = self._get_module(self._class_to_module[name])

The above exception was the direct cause of the following exception:

File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1392, in _get_module Traceback (most recent call last): File "C:\Users\Martin\Desktop\kohya_ss\sd-scripts\train_network.py", line 21, in from library import deepspeed_utils, model_util raise RuntimeError( File "C:\Users\Martin\Desktop\kohya_ss\sd-scripts\library\model_util.py", line 13, in

RuntimeErrorfrom diffusers import AutoencoderKL, DDIMScheduler, StableDiffusionPipeline  # , UNet2DConditionModel

: File "", line 1075, in _handle_fromlist Failed to import transformers.models.clip.image_processing_clip because of the following error (look up to see its traceback): DLL load failed while importing _pywrap_tf2: Error en una rutina de inicialización de biblioteca de vínculos dinámicos (DLL). File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 701, in getattr

value = getattr(module, name)

The above exception was the direct cause of the following exception:

File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 701, in getattr Traceback (most recent call last): File "C:\Users\Martin\Desktop\kohya_ss\sd-scripts\train_network.py", line 21, in value = getattr(module, name) from library import deepspeed_utils, model_util

File "C:\Users\Martin\Desktop\kohya_ss\sd-scripts\library\model_util.py", line 13, in File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 700, in getattr from diffusers import AutoencoderKL, DDIMScheduler, StableDiffusionPipeline # , UNet2DConditionModel File "", line 1075, in _handle_fromlist module = self._get_module(self._class_to_module[name]) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 712, in _get_module File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 701, in getattr raise RuntimeError( RuntimeErrorvalue = getattr(module, name): Failed to import diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion because of the following error (look up to see its traceback): Failed to import transformers.models.clip.image_processing_clip because of the following error (look up to see its traceback): DLL load failed while importing _pywrap_tf2: Error en una rutina de inicialización de biblioteca de vínculos dinámicos (DLL). File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 701, in getattr

value = getattr(module, name)

File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 700, in getattr module = self._get_module(self._class_to_module[name]) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 712, in _get_module raise RuntimeError( RuntimeError: Failed to import diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion because of the following error (look up to see its traceback): Failed to import transformers.models.clip.image_processing_clip because of the following error (look up to see its traceback): DLL load failed while importing _pywrap_tf2: Error en una rutina de inicialización de biblioteca de vínculos dinámicos (DLL). [2024-06-09 19:27:43,961] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 6468) of binary: C:\Users\Martin\Desktop\kohya_ss\venv\Scripts\python.exe Traceback (most recent call last): File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Program Files\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\Martin\Desktop\kohya_ss\venv\Scripts\accelerate.EXE__main__.py", line 7, in File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1008, in launch_command multi_gpu_launcher(args) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 666, in multi_gpu_launcher distrib_run.run(args) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\torch\distributed\run.py", line 797, in run elastic_launch( File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

C:/Users/Martin/Desktop/kohya_ss/sd-scripts/train_network.py FAILED

Failures: [1]: time : 2024-06-09_19:27:43 host : DESKTOP-413GD2B rank : 1 (local_rank: 1) exitcode : 1 (pid: 6220) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-06-09_19:27:43 host : DESKTOP-413GD2B rank : 0 (local_rank: 0) exitcode : 1 (pid: 6468) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

19:27:44-977402 INFO Training has ended.`

Any help will be greatly appreaciated.

b-fission commented 3 months ago

That error for tensorflow looks familiar. What CPU are you using, and do you know if it supports AVX?

martindellavecchia commented 3 months ago

That error for tensorflow looks familiar. What CPU are you using, and do you know if it supports AVX?

I am using a former mining rig I had, It's a pentium gold G5420 CPU, with 20GB of RAM. - i google'd and it does not support it.

wierd thing is, A1111 works super fine.

martindellavecchia commented 3 months ago

Now I am running a different issue:

`22:11:42-066006 INFO Command executed. Traceback (most recent call last): File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1390, in _get_module return importlib.import_module("." + module_name, self.name) File "C:\Program Files\Python310\lib\importlib__init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\models\clip\image_processing_clip.py", line 21, in from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\image_processing_utils.py", line 28, in from .image_transforms import center_crop, normalize, rescale File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\image_transforms.py", line 47, in import tensorflow as tf File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\tensorflow\init__.py", line 42, in from tensorflow.python import tf2 as _tf2 File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\tensorflow\python\tf2.py", line 21, in from tensorflow.python.platform import _pywrap_tf2 ImportError: DLL load failed while importing _pywrap_tf2: Error en una rutina de inicialización de biblioteca de vínculos dinámicos (DLL).

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 710, in _get_module return importlib.import_module("." + module_name, self.name) File "C:\Program Files\Python310\lib\importlib__init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in _gcd_import File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion.py", line 20, in from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer, CLIPVisionModelWithProjection File "", line 1075, in _handle_fromlist File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1381, in getattr value = getattr(module, name) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1380, in getattr__ module = self._get_module(self._class_to_module[name]) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\transformers\utils\import_utils.py", line 1392, in _get_module raise RuntimeError( RuntimeError: Failed to import transformers.models.clip.image_processing_clip because of the following error (look up to see its traceback): DLL load failed while importing _pywrap_tf2: Error en una rutina de inicialización de biblioteca de vínculos dinámicos (DLL).

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\Martin\Desktop\kohya_ss\sd-scripts\train_network.py", line 21, in from library import deepspeed_utils, model_util File "C:\Users\Martin\Desktop\kohya_ss\sd-scripts\library\model_util.py", line 13, in from diffusers import AutoencoderKL, DDIMScheduler, StableDiffusionPipeline # , UNet2DConditionModel File "", line 1075, in _handle_fromlist File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 701, in getattr value = getattr(module, name) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 701, in getattr value = getattr(module, name) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 700, in getattr module = self._get_module(self._class_to_module[name]) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\diffusers\utils\import_utils.py", line 712, in _get_module raise RuntimeError( RuntimeError: Failed to import diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion because of the following error (look up to see its traceback): Failed to import transformers.models.clip.image_processing_clip because of the following error (look up to see its traceback): DLL load failed while importing _pywrap_tf2: Error en una rutina de inicialización de biblioteca de vínculos dinámicos (DLL). Traceback (most recent call last): File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Program Files\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\Martin\Desktop\kohya_ss\venv\Scripts\accelerate.EXE__main__.py", line 7, in File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\Users\Martin\Desktop\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\Martin\Desktop\kohya_ss\venv\Scripts\python.exe', 'C:/Users/Martin/Desktop/kohya_ss/sd-scripts/train_network.py', '--config_file', 'C:/Users/Martin/Desktop/Training/AugustoDellaVecchia/Output\model/config_lora-20240609-221142.toml']' returned non-zero exit status 1. 22:11:55-539928 INFO Training has ended.`

martindellavecchia commented 3 months ago

Fixed on: https://github.com/bmaltais/kohya_ss/issues/2582