bmaltais / kohya_ss

Apache License 2.0
9.72k stars 1.25k forks source link

Fresh install on Linux doesn't work #810

Closed kuriot closed 1 year ago

kuriot commented 1 year ago

Hello.

Steps to install:

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
git checkout v21.5.11
./setup.sh

Output when I do accelerate config:

$ . venv/bin/activate
$ accelerate config

2023-05-16 08:38:40.243176: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-16 08:38:40.382942: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-05-16 08:38:41.001028: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rocm/lib:/opt/rocm/lib64:/opt/rocm/profiler/lib:/opt/rocm/profiler/lib64:/opt/rocm/opencl/lib:/opt/rocm/hip/lib:/opt/rocm/opencl/lib64:/opt/rocm/hip/lib64
2023-05-16 08:38:41.001136: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rocm/lib:/opt/rocm/lib64:/opt/rocm/profiler/lib:/opt/rocm/profiler/lib64:/opt/rocm/opencl/lib:/opt/rocm/hip/lib:/opt/rocm/opencl/lib64:/opt/rocm/hip/lib64
2023-05-16 08:38:41.001150: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine                                                                                                                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?                                                                                                                                             
No distributed training                                                                                                                                                          
Do you want to run your training on CPU only (even if a GPU / Apple Silicon device is available)? [yes/NO]:                                                                      
Do you wish to optimize your script with torch dynamo?[yes/NO]:                                                                                                                  
Do you want to use DeepSpeed? [yes/NO]:                                                                                                                                          
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:                                                                                
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
fp16                                                                                                                                                                             
accelerate configuration saved at /home/user/.cache/huggingface/accelerate/default_config.yaml 

When I launch training:

Validating that requirements are satisfied.
All requirements satisfied.
headless: False
Load CSS...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Folder 10_test: 6 images found
Folder 10_test: 60 steps
max_train_steps = 60
stop_text_encoder_training = 0
lr_warmup_steps = 6
accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --enable_bucket --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --train_data_dir="/tmp/test/img" --resolution=512,512 --output_dir="/tmp/test/model" --logging_dir="/tmp/test/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-05 --unet_lr=0.0001 --network_dim=8 --output_name="test" --lr_scheduler_num_cycles="1" --learning_rate="0.0001" --lr_scheduler="cosine" --lr_warmup_steps="6" --train_batch_size="1" --max_train_steps="60" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --cache_latents --optimizer_type="AdamW8bit" --max_data_loader_n_workers="0" --bucket_reso_steps=64 --xformers --bucket_no_upscale
2023-05-16 08:42:00.313872: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-16 08:42:00.450527: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-05-16 08:42:00.995113: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rocm/lib:/opt/rocm/lib64:/opt/rocm/profiler/lib:/opt/rocm/profiler/lib64:/opt/rocm/opencl/lib:/opt/rocm/hip/lib:/opt/rocm/opencl/lib64:/opt/rocm/hip/lib64
2023-05-16 08:42:00.995198: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rocm/lib:/opt/rocm/lib64:/opt/rocm/profiler/lib:/opt/rocm/profiler/lib64:/opt/rocm/opencl/lib:/opt/rocm/hip/lib:/opt/rocm/opencl/lib64:/opt/rocm/hip/lib64
2023-05-16 08:42:00.995209: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-05-16 08:42:02.857392: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-16 08:42:02.996250: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-05-16 08:42:03.554194: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rocm/lib:/opt/rocm/lib64:/opt/rocm/profiler/lib:/opt/rocm/profiler/lib64:/opt/rocm/opencl/lib:/opt/rocm/hip/lib:/opt/rocm/opencl/lib64:/opt/rocm/hip/lib64
2023-05-16 08:42:03.554290: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rocm/lib:/opt/rocm/lib64:/opt/rocm/profiler/lib:/opt/rocm/profiler/lib64:/opt/rocm/opencl/lib:/opt/rocm/hip/lib:/opt/rocm/opencl/lib64:/opt/rocm/hip/lib64
2023-05-16 08:42:03.554306: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
prepare tokenizer
Use DreamBooth method.
prepare images.
found directory /tmp/test/img/10_test contains 6 image files
60 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした[Dataset 0]
  batch_size: 1
  resolution: (512, 512)
  enable_bucket: True
  min_bucket_reso: 256
  max_bucket_reso: 1024
  bucket_reso_steps: 64
  bucket_no_upscale: True

  [Subset 0 of Dataset 0]
    image_dir: "/tmp/test/img/10_test"
    image_count: 6
    num_repeats: 10
    shuffle_caption: False
    keep_tokens: 0
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: test
    caption_extension: .caption

[Dataset 0]
loading image sizes.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 1081.47it/s]
make buckets
min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されますnumber of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)bucket 0: resolution (384, 512), count: 20
bucket 1: resolution (384, 640), count: 10
bucket 2: resolution (448, 448), count: 10
bucket 3: resolution (448, 512), count: 10
bucket 4: resolution (512, 384), count: 10
mean ar error (without repeats): 0.013716917997829814
prepare accelerator
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/accelerate/accelerator.py:249: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
  warnings.warn(
Using accelerator 0.15.0 or above.
loading model for process 0/1
load Diffusers pretrained models: runwayml/stable-diffusion-v1-5
Fetching 15 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 105208.29it/s]
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
Replace CrossAttention.forward to use xformers
[Dataset 0]
caching latents.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:02<00:00,  2.35it/s]
import network module: networks.lora
create LoRA network. base dim (rank): 8, alpha: 1.0
create LoRA for Text Encoder: 72 modules.
create LoRA for U-Net: 192 modules.
enable LoRA for text encoder
enable LoRA for U-Net
prepare optimizer, data loader etc.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/rocm/lib64'), PosixPath('/opt/rocm/profiler/lib64'), PosixPath('/home/user/Neural/kohya_ss/venv/lib/python3.10/site-packages/cv2/../../lib64'), PosixPath('/opt/rocm/profiler/lib'), PosixPath('/opt/rocm/opencl/lib64'), PosixPath('/opt/rocm/hip/lib64')}
  warn(msg)
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: /home/user/Neural/kohya_ss/venv/lib/python3.10/site-packages/cv2/../../lib64:/opt/rocm/lib:/opt/rocm/lib64:/opt/rocm/profiler/lib:/opt/rocm/profiler/lib64:/opt/rocm/opencl/lib:/opt/rocm/hip/lib:/opt/rocm/opencl/lib64:/opt/rocm/hip/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('local/unix'), PosixPath('@/tmp/.ICE-unix/2620,unix/unix')}
  warn(msg)
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/org/freedesktop/DisplayManager/Session1')}
  warn(msg)
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/etc/gtk/gtkrc'), PosixPath('/home/user/.gtkrc')}
  warn(msg)
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/Sessions/1')}
  warn(msg)
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/org/freedesktop/DisplayManager/Seat0')}
  warn(msg)
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/home/user/.cache/dotnet_bundle_extract')}
  warn(msg)
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//debuginfod.fedoraproject.org')}
  warn(msg)
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/share/modulefiles/Linux'), PosixPath('/usr/share/modulefiles/Core')}
  warn(msg)
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/Windows/1')}
  warn(msg)
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('() {  eval "$($LMOD_DIR/ml_cmd "$@")"\n}')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
  warn(msg)
ERROR: /home/user/Neural/kohya_ss/venv/bin/python: undefined symbol: cudaRuntimeGetVersion
CUDA SETUP: libcudart.so path is None
CUDA SETUP: Is seems that your cuda installation is not in your path. See https://github.com/TimDettmers/bitsandbytes/issues/85 for more information.
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
  warn(msg)
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 00
CUDA SETUP: Loading binary /home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
use 8-bit AdamW optimizer | {}
running training / 学習開始  num train images * repeats / 学習画像の数×繰り返し回数: 60
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 60
  num epochs / epoch数: 1
  batch size per device / バッチサイズ: 1
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 60
steps:   0%|                                                                                                                                              | 0/60 [00:00<?, ?it/s]
epoch 1/1
Traceback (most recent call last):
  File "/home/user/Neural/kohya_ss/train_network.py", line 783, in <module>
    train(args)
  File "/home/user/Neural/kohya_ss/train_network.py", line 634, in train
    optimizer.step()
  File "/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/accelerate/optimizer.py", line 134, in step
    self.scaler.step(self.optimizer, closure)
  File "/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 338, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 285, in _maybe_opt_step
    retval = optimizer.step(*args, **kwargs)
  File "/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/torch/optim/optimizer.py", line 113, in wrapper
    return func(*args, **kwargs)
  File "/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 263, in step
    self.update_step(group, p, gindex, pindex)
  File "/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/optim/optimizer.py", line 504, in update_step
    F.optimizer_update_8bit_blockwise(
  File "/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/bitsandbytes/functional.py", line 975, in optimizer_update_8bit_blockwise
    str2optimizer8bit_blockwise[optimizer_name][0](
NameError: name 'str2optimizer8bit_blockwise' is not defined
steps:   0%|                                                                                                                                              | 0/60 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/user/Neural/kohya_ss/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/accelerate/commands/launch.py", line 923, in launch_command
    simple_launcher(args)
  File "/home/user/Neural/kohya_ss/venv/lib64/python3.10/site-packages/accelerate/commands/launch.py", line 579, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/user/Neural/kohya_ss/venv/bin/python', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=/tmp/test/img', '--resolution=512,512', '--output_dir=/tmp/test/model', '--logging_dir=/tmp/test/log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-05', '--unet_lr=0.0001', '--network_dim=8', '--output_name=test', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=6', '--train_batch_size=1', '--max_train_steps=60', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.
kuriot commented 1 year ago

I forgot to write that GPU is Nvidia RTX 3070.

15ky3 commented 1 year ago

Hi, i also had issues to get this running in a vast.ai machine (it’s also Linux) Spend a few hours to get it running and wrote a script for that. Maybe it helps you.

You can find it here

Please report of it work to you. My problem is if I create Lora with the Linux machines I get desaturated images when use my trained Lora’s. Could you please report if you had similar problems?

Thanks 😊

kuriot commented 1 year ago

Hi, i also had issues to get this running in a vast.ai machine (it’s also Linux) Spend a few hours to get it running and wrote a script for that. Maybe it helps you.

You can find it here

Please report of it work to you. My problem is if I create Lora with the Linux machines I get desaturated images when use my trained Lora’s. Could you please report if you had similar problems?

Thanks blush

Your script helped me, thanks. But I didn't use it entirely. Setting MKL_THREADING_LAYER helped me with some strange problem about Intel CPU (I use AMD).

I also decided to check what setup.sh does and do everything my way and with conda.

That is my installation process:

rm -rf ./venv
conda create -n kohya python=3.10.9
conda activate kohya
conda install pytorch==1.13.1 torchvision==0.14.1 xformers -c pytorch -c nvidia -c xformers
pip install triton
conda install -c conda-forge cudatoolkit=11.8.0
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

And launch script:

#!/usr/bin/env bash

source ~/.miniconda3/etc/profile.d/conda.sh
conda activate kohya

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH
export MKL_THREADING_LAYER=1

SCRIPT_DIR=$(cd -- "$(dirname -- "$0")" && pwd)

cd "$SCRIPT_DIR"

python "$SCRIPT_DIR/kohya_gui.py" "$@"

Now I don't get any errors and training is working. On my Nvidia RTX 3070 115W laptop video card I get 2.7 tokens per second with the maximum resolution of dataset images 768x768. I don't know if it is OK, or no.

I use Torch v1.13.1, because with 2.0.0 and 2.0.1 I had problems in Stable Diffusion, so I stick to 1.13.1 on Linux for now.

15ky3 commented 1 year ago

Hm will try this out, maybe it helps me with my desaturated images.

So you don’t install it with the setup.sh script? Only with the steps above?

kuriot commented 1 year ago

Hm will try this out, maybe it helps me with my desaturated images.

So you don’t install it with the setup.sh script? Only with the steps above?

I removed completely kohya_ss to check if I need to run setup.sh and no, I didn't need to.

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss

conda create -n kohya python=3.10.9
conda activate kohya
conda install pytorch==1.13.1 torchvision==0.14.1 xformers -c pytorch -c nvidia -c xformers
pip install triton
conda install -c conda-forge cudatoolkit=11.8.0
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

Then I created start.sh file:

#!/usr/bin/env bash

source ~/.miniconda3/etc/profile.d/conda.sh
conda activate kohya

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH
export MKL_THREADING_LAYER=1

SCRIPT_DIR=$(cd -- "$(dirname -- "$0")" && pwd)

cd "$SCRIPT_DIR"

python "$SCRIPT_DIR/kohya_gui.py" "$@"
chmod +x start.sh
./start.sh

Everything works.

15ky3 commented 1 year ago

Thanks, will try it out tomorrow 👍

kuriot commented 1 year ago

I've updated install commands to fix some tensor errors in logs. Also, captioning didn't work until I fixed it.

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss

conda create -n kohya python=3.10.9
conda activate kohya
conda install pytorch==1.13.1 torchvision==0.14.1 xformers -c pytorch -c nvidia -c xformers
conda install -c conda-forge cudatoolkit=11.8.0
python3 -m pip install 'nvidia-cudnn-cu11>=8.6<9' tensorflow==2.11.* tensorrt==8.6.1 triton
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
ln -sr $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.8 $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.7
ln -sr $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer_plugin.so.8 $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer_plugin.so.7

And start.sh:

#!/usr/bin/env bash

source ~/.miniconda3/etc/profile.d/conda.sh
conda activate kohya

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH
export MKL_THREADING_LAYER=1

SCRIPT_DIR=$(cd -- "$(dirname -- "$0")" && pwd)

cd "$SCRIPT_DIR"

python "$SCRIPT_DIR/kohya_gui.py" "$@"
15ky3 commented 1 year ago

Thanks, build a Script in my repo. Will see if it works to me :)

alcaitiff commented 1 year ago

For me it returned this error: start.sh Traceback (most recent call last): File "/AI/kohya/kohya_gui.py", line 4, in <module> from dreambooth_gui import dreambooth_tab File "/AI/kohya/dreambooth_gui.py", line 13, in <module> from library.common_gui import ( File "/AI/kohya/library/common_gui.py", line 2, in <module> from easygui import msgbox ModuleNotFoundError: No module named 'easygui'

alcaitiff commented 1 year ago

I installed the easygui with: pip install easygui

And after that it started the GUI. But when I tried to train i got this error: _ctypes.cpython-39-x86_64-linux-gnu.so: undefined symbol: ffi_closure_alloc, version LIBFFI_CLOSURE_7.0

15ky3 commented 1 year ago

On which System do you running this?

kuriot commented 1 year ago

Ah, sorry, it's my mistake. I think I've installed requirements.txt in conda environment at some point and forgot about it. I just created a clean environment and found out that I need to install some requirements after commands in previous messages. So, I removed everything already installed in environment from requirements.txt, then did pip install -r requirements.txt and it worked.

Here's my requirements.txt from which I removed what was already installed manually:

accelerate==0.18.0
albumentations==1.3.0
altair==4.2.2
dadaptation==1.5
diffusers[torch]==0.10.2
easygui==0.98.3
einops==0.6.0
ftfy==6.1.1
gradio==3.28.1
lion-pytorch==0.0.6
opencv-python==4.7.0.68
pytorch-lightning==1.9.0
safetensors==0.2.6
toml==0.10.2
voluptuous==0.13.1
wandb==0.15.0
fairscale==0.4.13
requests==2.28.2
timm==0.6.12
huggingface-hub==0.13.3
lycoris_lora==0.1.4
alcaitiff commented 1 year ago

After installing all requirements i got this error:

WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 1.13.1 with CUDA 1106 (you have 1.13.1) Python 3.10.11 (you have 3.10.11) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details 2023-05-16 14:43:05.161259: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-05-16 14:43:05.282488: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0. 2023-05-16 14:43:06.021567: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory;

15ky3 commented 1 year ago

Yeah Yeah Yeah its working without desaturated images, Big Thanks @kuriot

Installed it successfully. Wrote a Bash Script for all the steps but the script dont work, you need copy&paste the commands. Then it work. Maybe it help someone to run this under Linux. Try to figure out whats the problem in the script but when you execute the steps manually it works. Today i´m to drunk xD

You can find it here

kuriot commented 1 year ago

Before answering your previous message I decided to do a clean install and it didn't work. :) For the last couple hours I tried to install it. Literally these commands in this sequence work for me. Anyway, for someone who will stumble upon this issue it may be a starting point.

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss

conda create -y -n kohya python=3.10.9
conda activate kohya

conda install -y pytorch==1.13.1 torchvision==0.14.1 xformers -c pytorch -c nvidia -c xformers

conda install -y -c conda-forge cudatoolkit=11.8.0

python3 -m pip install 'nvidia-cudnn-cu11>=8.6<9' triton

python3 -m pip install --extra-index-url https://pypi.nvidia.com tensorrt-libs

python3 -m pip install -r requirements.txt

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
ln -sr $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.8 $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.7
ln -sr $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer_plugin.so.8 $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer_plugin.so.7

python3 -m pip cache purge
conda clean -afy

conda deactivate
aycaecemgul commented 1 year ago

I used python3 -m pip install -r requirements_unix.txt instead and used python3 kohya_gui.py --share to run and it worked! thanks.

future141 commented 1 year ago

Before answering your previous message I decided to do a clean install and it didn't work. :) For the last couple hours I tried to install it. Literally these commands in this sequence work for me. Anyway, for someone who will stumble upon this issue it may be a starting point.

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss

conda create -y -n kohya python=3.10.9
conda activate kohya

conda install -y pytorch==1.13.1 torchvision==0.14.1 xformers -c pytorch -c nvidia -c xformers

conda install -y -c conda-forge cudatoolkit=11.8.0

python3 -m pip install 'nvidia-cudnn-cu11>=8.6<9' triton

python3 -m pip install --extra-index-url https://pypi.nvidia.com tensorrt-libs

python3 -m pip install -r requirements.txt

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
ln -sr $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.8 $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.7
ln -sr $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer_plugin.so.8 $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer_plugin.so.7

python3 -m pip cache purge
conda clean -afy

conda deactivate

Hi kuriot, I wonder if you still have this problem with the newest release. I have the same problem now. I wonder if you could have a disscussion for this . Regards,

kuriot commented 1 year ago

@future141 I stick to an older version. Also, there's a problem with gradio which is fixed by manual install and in branch I use there are problems with quotes in requirements_linux.txt file.

So, here are steps I checked on a clean install and it works fine:

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
git checkout v21.7.16

conda create -y -n kohya python=3.10.9
conda activate kohya

python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

python3 -m pip install 'xformers==0.0.20' 'bitsandbytes==0.35.0' 'accelerate==0.15.0' 'tensorboard==2.12.1' 'tensorflow==2.12.0' -r requirements.txt

conda install -y -c conda-forge cudatoolkit=11.8.0

python3 -m pip install 'nvidia-cudnn-cu11>=8.6<9'

python3 -m pip install --extra-index-url https://pypi.nvidia.com tensorrt-libs

python3 -m pip install 'gradio==3.36.1'

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
ln -sr $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.8 $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.7
ln -sr $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer_plugin.so.8 $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer_plugin.so.7

python3 -m pip cache purge
conda clean -afy

conda deactivate

And run.sh file without changes. Just fix in the script the path to your ~/.miniconda3/etc/profile.d/conda.sh on line 3. I'll add it here again for you not to scroll up:

#!/usr/bin/env bash

source ~/.miniconda3/etc/profile.d/conda.sh
conda activate kohya

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH
export MKL_THREADING_LAYER=1

SCRIPT_DIR=$(cd -- "$(dirname -- "$0")" && pwd)

cd "$SCRIPT_DIR"

python "$SCRIPT_DIR/kohya_gui.py" "$@"
future141 commented 1 year ago

@future141 I stick to an older version. Also, there's a problem with gradio which is fixed by manual install and in branch I use there are problems with quotes in requirements_linux.txt file.

So, here are steps I checked on a clean install and it works fine:

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
git checkout v21.7.16

conda create -y -n kohya python=3.10.9
conda activate kohya

python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

python3 -m pip install 'xformers==0.0.20' 'bitsandbytes==0.35.0' 'accelerate==0.15.0' 'tensorboard==2.12.1' 'tensorflow==2.12.0' -r requirements.txt

conda install -y -c conda-forge cudatoolkit=11.8.0

python3 -m pip install 'nvidia-cudnn-cu11>=8.6<9'

python3 -m pip install --extra-index-url https://pypi.nvidia.com tensorrt-libs

python3 -m pip install 'gradio==3.36.1'

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
ln -sr $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.8 $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer.so.7
ln -sr $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer_plugin.so.8 $CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs/libnvinfer_plugin.so.7

python3 -m pip cache purge
conda clean -afy

conda deactivate

And run.sh file without changes. Just fix in the script the path to your ~/.miniconda3/etc/profile.d/conda.sh on line 3. I'll add it here again for you not to scroll up:

#!/usr/bin/env bash

source ~/.miniconda3/etc/profile.d/conda.sh
conda activate kohya

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/python3.10/site-packages/tensorrt_libs:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH
export MKL_THREADING_LAYER=1

SCRIPT_DIR=$(cd -- "$(dirname -- "$0")" && pwd)

cd "$SCRIPT_DIR"

python "$SCRIPT_DIR/kohya_gui.py" "$@"

My friend, I found the problem in my case (a freshly installed ubuntu). The case is shown in https://github.com/bmaltais/kohya_ss/issues/1109. You could possibly try this method, wonder if it can help.

kuriot commented 1 year ago

@future141 Thanks, I'll take a look. With SDXL 1.0 out I want to try to train it, so it's time to update Kohya. :)