bmaltais / kohya_ss

Apache License 2.0
9.54k stars 1.23k forks source link

Does anyone have a last known working commit? #943

Closed Ancanus closed 8 months ago

Ancanus commented 1 year ago

Ever since I pulled I've been unable to get Torch 1 or Torch 2 working:

>>> import torch
>>> torch.cuda.is_available()
False

which results in Torch reports CUDA not available during gui.bat start up.

nvidia-smi.txt pipfreeze.txt

bmaltais commented 1 year ago

Did you try deleting the venv and then re-run setup.bat? This usually fix all those issues.

You can go back to previous releases with:

git checkout releasename

like in

git checkout v21.5.2

Ancanus commented 1 year ago

Unfortunately no amount of venv recreation is solving this one. Will try to isolate last non-repro release.

sukhysall commented 1 year ago

OKay so I did a git pull at about 4pm GMT. Just tried to train a lora and I get the below:

=============================================================

23:51:00-418665 INFO nVidia toolkit detected 23:51:01-313733 INFO Torch 2.0.1+cu118 23:51:01-326239 INFO Torch backend: nVidia CUDA 11.8 cuDNN 8700 23:51:01-327921 INFO Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24564 Arch (8, 9) Cores 128 23:51:01-328420 INFO Verifying requirements 23:51:02-669277 INFO headless: False 23:51:02-671777 INFO Load CSS... Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch(). 23:51:10-855045 INFO Loading config... 23:51:13-824118 INFO Start training LoRA Standard ... 23:51:13-825618 INFO Folder 2_rsemnre woman: 20 images found 23:51:13-826118 INFO Folder 2_rsemnre woman: 40 steps 23:51:13-826617 INFO Total steps: 40 23:51:13-827117 INFO Train batch size: 4 23:51:13-827617 INFO Gradient accumulation steps: 1.0 23:51:13-828117 INFO Epoch: 100 23:51:13-828617 INFO Regulatization factor: 1 23:51:13-829117 INFO max_train_steps (40 / 4 / 1.0 100 1) = 1000 23:51:13-830117 INFO stop_text_encoder_training = 0 23:51:13-830617 INFO lr_warmup_steps = 50 23:51:13-832617 INFO accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --enable_bucket --pretrained_model_name_or_path="C:/stable-diffusion-webui/models/Stable-diffusion/realisticVision_v13.safetensors" --train_data_dir="D:/0.Lora/#Person/rsemnre/img" --resolution=768,768 --output_dir="D:/0.Lora/#Person/rsemnre/model" --logging_dir="D:/0.Lora/#Person/rsemnre/log" --network_alpha="64" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=1e-05 --unet_lr=5e-05 --network_dim=128 --output_name="custom_rsemnre_realv13_20_torch2" --lr_scheduler_num_cycles="100" --learning_rate="3e-06" --lr_scheduler="cosine_with_restarts" --lr_warmup_steps="50" --train_batch_size="4" --max_train_steps="1000" --save_every_n_epochs="10" --mixed_precision="bf16" --save_precision="bf16" --caption_extension=".txt" --cache_latents --optimizer_type="Lion" --max_data_loader_n_workers="2" --max_token_length=225 --clip_skip=2 --bucket_reso_steps=64 --xformers --persistent_data_loader_workers --bucket_no_upscale --noise_offset=0.1 --wandb_api_key="False" --sample_sampler=euler_a --sample_prompts="D:/0.Lora/#Person/rsemnre/model\sample\prompt.txt" --sample_every_n_epochs="10" --sample_every_n_steps="100" Traceback (most recent call last): File "C:\Users\sukhy\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\sukhy\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\kohya_ss\venv\Scripts\accelerate.exe__main.py", line 4, in File "C:\kohya_ss\venv\lib\site-packages\accelerate__init__.py", line 3, in from .accelerator import Accelerator File "C:\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 39, in from .tracking import LOGGER_TYPE_TO_CLASS, GeneralTracker, filter_trackers File "C:\kohya_ss\venv\lib\site-packages\accelerate\tracking.py", line 42, in from torch.utils import tensorboard File "C:\kohya_ss\venv\lib\site-packages\torch\utils\tensorboard__init__.py", line 12, in from .writer import FileWriter, SummaryWriter # noqa: F401 File "C:\kohya_ss\venv\lib\site-packages\torch\utils\tensorboard\writer.py", line 16, in from ._embedding import ( File "C:\kohya_ss\venv\lib\site-packages\torch\utils\tensorboard_embedding.py", line 9, in _HAS_GFILE_JOIN = hasattr(tf.io.gfile, "join") File "C:\kohya_ss\venv\lib\site-packages\tensorboard\lazy.py", line 65, in getattr__ return getattr(load_once(self), attr_name) AttributeError: module 'tensorflow' has no attribute 'io'

sukhysall commented 1 year ago

This was working fine prior to doing the git pull at 4pm GMT

bmaltais commented 1 year ago

Delete the venv folder and re-run setup.bat. This error is due to pip not properly installing some of the modules. This will straignten things up.

Ancanus commented 1 year ago

Friends, I have exhausted all possible searches. This is what I was able to conclude:

  1. Something very bad changed that is triggered when you git pull.
  2. The root problem is not in this repository.
  3. Whatever it is, it disables the CUDA capabilities off your Torch installation. That means that it will manifest both here and in A1111, and in SD.Next as if you didn't have a GPU: FP16 not detected, rendering taking 200:00:00, infinite time caching latents, you name it.
  4. I also experienced the tensorflow.io errors, which did go away with the venv reset but this one about
    torch.cuda.is_available()

    returning False apparently is here to stay. I'm sorry. Maybe a future version will do away with this misery.

sukhysall commented 1 year ago

Delete the venv folder and re-run setup.bat. This error is due to pip not properly installing some of the modules. This will straignten things up.

Many thanks for this. I'm 99% of the way there. So I deleted Venv and re-ran Setup. I finally got the training working BUT I do get a message I've never seen before:

"A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton'"

The tool does appear to be training right now so unsure on the severity of this missing module named 'triton'

bmaltais commented 1 year ago

Delete the venv folder and re-run setup.bat. This error is due to pip not properly installing some of the modules. This will straignten things up.

Many thanks for this. I'm 99% of the way there. So I deleted Venv and re-ran Setup. I finally got the training working BUT I do get a message I've never seen before:

"A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton'"

The tool does appear to be training right now so unsure on the severity of this missing module named 'triton'

This one is unfortunately linked to torch 2.0.1. No one has compiled a triton version for it yet. It does not cause any model issue but is really annoying. I will research how triton windows was compiled by the original person and see if I could compile my own.

Elldreth commented 1 year ago

Friends, I have exhausted all possible searches. This is what I was able to conclude:

  1. Something very bad changed that is triggered when you git pull.
  2. The root problem is not in this repository.
  3. Whatever it is, it disables the CUDA capabilities off your Torch installation. That means that it will manifest both here and in A1111, and in SD.Next as if you didn't have a GPU: FP16 not detected, rendering taking 200:00:00, infinite time caching latents, you name it.
  4. I also experienced the tensorflow.io errors, which did go away with the venv reset but this one about
torch.cuda.is_available()

returning False apparently is here to stay. I'm sorry. Maybe a future version will do away with this misery.

I ran into this very thing, I deleted my venv, recreated it and then reran the setup.bat. If you don't run the setup.bat it will not install the modules correctly. At least that is my experience.

bmaltais commented 1 year ago

I ran into this very thing, I deleted my venv, recreated it and then reran the setup.bat. If you don't run the setup.bat it will not install the modules correctly. At least that is my experience.

Manually setting things up using the requirements.txt file only will not work as that file is mostly for linux and MacOS. If you manually install things you need to install torch and other things (like xformers) and then install the modules from the torch version specific requirements file... so always use setup.bat

Ancanus commented 1 year ago

Friends, I was able to fix it. I had to manually install CUDA 11.8 and after that it started working.

Djanghost commented 1 year ago

@bmaltais could you please make your repo work again ? It's been a while i and many users can't use it. Each time i uninstall it to come back few days later thinking it's finally work but nothing

bmaltais commented 1 year ago

@bmaltais could you please make your repo work again ? It's been a while i and many users can't use it. Each time i uninstall it to come back few days later thinking it's finally work but nothing

Well, it work for many people... so I think the issue might be something isolated to your computer. You can try to go back to a version you know was working and see if things start working again. Best is to delete the full venv, then do a

git checkout <release number like v2.21.5>

then run setup.bat and use torch 1 as it is as close as can be to kohya's supported pip libraries.

And see if things work again. If they do the you might want to stick with that release for a while. It is possible one of the update from kohya in his code is the cause. Remember, I don't write the trainer code, I just wrap it in a gradio gui. Most of the issues will come from the trainer code upgrade (and sometime from me due to python module upgrades for torch 2).

Zyin055 commented 1 year ago

Had the same AttributeError: module 'tensorflow' has no attribute 'io' error after git pull and running setup.bat. Solution for me was to delete the venv folder and re-run setup.bat, but now any lora/lyco I train doesn't work when using it.