Closed Ancanus closed 8 months ago
Did you try deleting the venv and then re-run setup.bat? This usually fix all those issues.
You can go back to previous releases with:
git checkout releasename
like in
git checkout v21.5.2
Unfortunately no amount of venv
recreation is solving this one. Will try to isolate last non-repro release.
OKay so I did a git pull at about 4pm GMT. Just tried to train a lora and I get the below:
=============================================================
23:51:00-418665 INFO nVidia toolkit detected 23:51:01-313733 INFO Torch 2.0.1+cu118 23:51:01-326239 INFO Torch backend: nVidia CUDA 11.8 cuDNN 8700 23:51:01-327921 INFO Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24564 Arch (8, 9) Cores 128 23:51:01-328420 INFO Verifying requirements 23:51:02-669277 INFO headless: False 23:51:02-671777 INFO Load CSS... Running on local URL: http://127.0.0.1:7860
To create a public link, set share=True
in launch()
.
23:51:10-855045 INFO Loading config...
23:51:13-824118 INFO Start training LoRA Standard ...
23:51:13-825618 INFO Folder 2_rsemnre woman: 20 images found
23:51:13-826118 INFO Folder 2_rsemnre woman: 40 steps
23:51:13-826617 INFO Total steps: 40
23:51:13-827117 INFO Train batch size: 4
23:51:13-827617 INFO Gradient accumulation steps: 1.0
23:51:13-828117 INFO Epoch: 100
23:51:13-828617 INFO Regulatization factor: 1
23:51:13-829117 INFO max_train_steps (40 / 4 / 1.0 100 1) = 1000
23:51:13-830117 INFO stop_text_encoder_training = 0
23:51:13-830617 INFO lr_warmup_steps = 50
23:51:13-832617 INFO accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --enable_bucket
--pretrained_model_name_or_path="C:/stable-diffusion-webui/models/Stable-diffusion/realisticVision_v13.safetensors" --train_data_dir="D:/0.Lora/#Person/rsemnre/img"
--resolution=768,768 --output_dir="D:/0.Lora/#Person/rsemnre/model" --logging_dir="D:/0.Lora/#Person/rsemnre/log" --network_alpha="64" --save_model_as=safetensors
--network_module=networks.lora --text_encoder_lr=1e-05 --unet_lr=5e-05 --network_dim=128 --output_name="custom_rsemnre_realv13_20_torch2" --lr_scheduler_num_cycles="100"
--learning_rate="3e-06" --lr_scheduler="cosine_with_restarts" --lr_warmup_steps="50" --train_batch_size="4" --max_train_steps="1000" --save_every_n_epochs="10" --mixed_precision="bf16"
--save_precision="bf16" --caption_extension=".txt" --cache_latents --optimizer_type="Lion" --max_data_loader_n_workers="2" --max_token_length=225 --clip_skip=2 --bucket_reso_steps=64
--xformers --persistent_data_loader_workers --bucket_no_upscale --noise_offset=0.1 --wandb_api_key="False" --sample_sampler=euler_a
--sample_prompts="D:/0.Lora/#Person/rsemnre/model\sample\prompt.txt" --sample_every_n_epochs="10" --sample_every_n_steps="100"
Traceback (most recent call last):
File "C:\Users\sukhy\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\sukhy\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\kohya_ss\venv\Scripts\accelerate.exe__main.py", line 4, in
This was working fine prior to doing the git pull at 4pm GMT
Delete the venv folder and re-run setup.bat. This error is due to pip not properly installing some of the modules. This will straignten things up.
Friends, I have exhausted all possible searches. This is what I was able to conclude:
git pull
.venv reset
but this one about
torch.cuda.is_available()
returning False apparently is here to stay. I'm sorry. Maybe a future version will do away with this misery.
Delete the venv folder and re-run setup.bat. This error is due to pip not properly installing some of the modules. This will straignten things up.
Many thanks for this. I'm 99% of the way there. So I deleted Venv and re-ran Setup. I finally got the training working BUT I do get a message I've never seen before:
"A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton'"
The tool does appear to be training right now so unsure on the severity of this missing module named 'triton'
Delete the venv folder and re-run setup.bat. This error is due to pip not properly installing some of the modules. This will straignten things up.
Many thanks for this. I'm 99% of the way there. So I deleted Venv and re-ran Setup. I finally got the training working BUT I do get a message I've never seen before:
"A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton'"
The tool does appear to be training right now so unsure on the severity of this missing module named 'triton'
This one is unfortunately linked to torch 2.0.1. No one has compiled a triton version for it yet. It does not cause any model issue but is really annoying. I will research how triton windows was compiled by the original person and see if I could compile my own.
Friends, I have exhausted all possible searches. This is what I was able to conclude:
- Something very bad changed that is triggered when you
git pull
.- The root problem is not in this repository.
- Whatever it is, it disables the CUDA capabilities off your Torch installation. That means that it will manifest both here and in A1111, and in SD.Next as if you didn't have a GPU: FP16 not detected, rendering taking 200:00:00, infinite time caching latents, you name it.
- I also experienced the tensorflow.io errors, which did go away with the
venv reset
but this one abouttorch.cuda.is_available()
returning False apparently is here to stay. I'm sorry. Maybe a future version will do away with this misery.
I ran into this very thing, I deleted my venv, recreated it and then reran the setup.bat. If you don't run the setup.bat it will not install the modules correctly. At least that is my experience.
I ran into this very thing, I deleted my venv, recreated it and then reran the setup.bat. If you don't run the setup.bat it will not install the modules correctly. At least that is my experience.
Manually setting things up using the requirements.txt file only will not work as that file is mostly for linux and MacOS. If you manually install things you need to install torch and other things (like xformers) and then install the modules from the torch version specific requirements file... so always use setup.bat
Friends, I was able to fix it. I had to manually install CUDA 11.8 and after that it started working.
@bmaltais could you please make your repo work again ? It's been a while i and many users can't use it. Each time i uninstall it to come back few days later thinking it's finally work but nothing
@bmaltais could you please make your repo work again ? It's been a while i and many users can't use it. Each time i uninstall it to come back few days later thinking it's finally work but nothing
Well, it work for many people... so I think the issue might be something isolated to your computer. You can try to go back to a version you know was working and see if things start working again. Best is to delete the full venv, then do a
git checkout <release number like v2.21.5>
then run setup.bat and use torch 1 as it is as close as can be to kohya's supported pip libraries.
And see if things work again. If they do the you might want to stick with that release for a while. It is possible one of the update from kohya in his code is the cause. Remember, I don't write the trainer code, I just wrap it in a gradio gui. Most of the issues will come from the trainer code upgrade (and sometime from me due to python module upgrades for torch 2).
Had the same AttributeError: module 'tensorflow' has no attribute 'io'
error after git pull
and running setup.bat
. Solution for me was to delete the venv folder and re-run setup.bat
, but now any lora/lyco I train doesn't work when using it.
Ever since I pulled I've been unable to get Torch 1 or Torch 2 working:
which results in
Torch reports CUDA not available
during gui.bat start up.nvidia-smi.txt pipfreeze.txt