bmaltais / kohya_ss

Apache License 2.0
9.65k stars 1.24k forks source link

WD14 Tagger Not Recognizing CUDA After Update? #2529

Closed Rcorthy closed 5 months ago

Rcorthy commented 5 months ago

I recently updated to the latest version and have been unable to use the WD14 tagger ever since. I am getting different errors when I try with and without onnx. When I try with onnx, I think the error is telling me that onnx can't recognize a CUDA-capable device, but I checked and made certain that my computer is CUDA-capable and has the latest CUDA installed. I've tried manually identifying my GPU in the setup and using a fresh install. The error code for this case is the first one When I try wihtout onnx, I seem to getting something about Keras 3 only supporting certain file types. I was kinda able to decipher the other error text, but I haven't the slightest what this one's about. The error code for this case is the second one attached. It may be worth noting that on other GUIs like Automatic1111 and ComfyUI, I can get the tagging models to work just fine, but I really prefer koyha. Any guidance would be greatly appreciated.

. . . CASE 1: WITH ONNX

EP Error D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:121 onnxruntime::CudaCall D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:114 onnxruntime::CudaCall CUDA failure 100: no CUDA-capable device is detected ; GPU=-727787712 ; hostname=RHIT-PW01EGB5 ; file=D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_executionprovider.cc ; line=245 ; expr=cudaSetDevice(info.device_id);

when using ['CUDAExecutionProvider'] Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.


Traceback (most recent call last): File "C:\Users\billipem\AIUI\kohya_ss\venv\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in init self._create_inference_session(providers, provider_options, disabled_optimizers) File "C:\Users\billipem\AIUI\kohya_ss\venv\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session sess.initialize_session(providers, provider_options, disabled_optimizers) RuntimeError: D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:121 onnxruntime::CudaCall D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:114 onnxruntime::CudaCall CUDA failure 100: no CUDA-capable device is detected ; GPU=-727787712 ; hostname=RHIT-PW01EGB5 ; file=D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_executionprovider.cc ; line=245 ; expr=cudaSetDevice(info.device_id);

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\billipem\AIUI\kohya_ss\sd-scripts\finetune\tag_images_by_wd14_tagger.py", line 514, in main(args) File "C:\Users\billipem\AIUI\kohya_ss\sd-scripts\finetune\tag_images_by_wd14_tagger.py", line 154, in main ort_sess = ort.InferenceSession( File "C:\Users\billipem\AIUI\kohya_ss\venv\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 432, in init raise fallback_error from e File "C:\Users\billipem\AIUI\kohya_ss\venv\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 427, in init self._create_inference_session(self._fallback_providers, None) File "C:\Users\billipem\AIUI\kohya_ss\venv\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session sess.initialize_session(providers, provider_options, disabled_optimizers) RuntimeError: D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:121 onnxruntime::CudaCall D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:114 onnxruntime::CudaCall CUDA failure 100: no CUDA-capable device is detected ; GPU=-727787712 ; hostname=RHIT-PW01EGB5 ; file=D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_executionprovider.cc ; line=245 ; expr=cudaSetDevice(info.device_id);

Traceback (most recent call last): File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Program Files\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\billipem\AIUI\kohya_ss\venv\Scripts\accelerate.EXE__main__.py", line 7, in File "C:\Users\billipem\AIUI\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Users\billipem\AIUI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\Users\billipem\AIUI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\billipem\AIUI\kohya_ss\venv\Scripts\python.exe', 'C:/Users/billipem/AIUI/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py', '--batch_size', '1', '--caption_extension', '.txt', '--caption_separator', ', ', '--debug', '--frequency_tags', '--general_threshold', '0.3', '--max_data_loader_n_workers', '2', '--onnx', '--remove_underscore', '--repo_id', 'SmilingWolf/wd-v1-4-convnextv2-tagger-v2', '--thresh', '0.4', 'C:/Users/billipem/AIUI/Training_Data/BSQ/New']' returned non-zero exit status 1. 23:33:22-474728 INFO ...captioning done

. . . CASE 2: WITHOUT ONNX

Traceback (most recent call last): File "C:\Users\billipem\AIUI\kohya_ss\sd-scripts\finetune\tag_images_by_wd14_tagger.py", line 514, in main(args) File "C:\Users\billipem\AIUI\kohya_ss\sd-scripts\finetune\tag_images_by_wd14_tagger.py", line 165, in main model = load_model(f"{model_location}") File "C:\Users\billipem\AIUI\kohya_ss\venv\lib\site-packages\keras\src\saving\saving_api.py", line 193, in load_model raise ValueError( ValueError: File format not supported: filepath=wd14_tagger_model\SmilingWolf_wd-v1-4-convnextv2-tagger-v2. Keras 3 only supports V3 .keras files and legacy H5 format files (.h5 extension). Note that the legacy SavedModel format is not supported by load_model() in Keras 3. In order to reload a TensorFlow SavedModel as an inference-only layer in Keras 3, use keras.layers.TFSMLayer(wd14_tagger_model\SmilingWolf_wd-v1-4-convnextv2-tagger-v2, call_endpoint='serving_default') (note that your call_endpoint might have a different name). Traceback (most recent call last): File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Program Files\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\billipem\AIUI\kohya_ss\venv\Scripts\accelerate.EXE__main__.py", line 7, in File "C:\Users\billipem\AIUI\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Users\billipem\AIUI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\Users\billipem\AIUI\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\billipem\AIUI\kohya_ss\venv\Scripts\python.exe', 'C:/Users/billipem/AIUI/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py', '--batch_size', '1', '--caption_extension', '.txt', '--caption_separator', ', ', '--debug', '--frequency_tags', '--general_threshold', '0.3', '--max_data_loader_n_workers', '2', '--remove_underscore', '--repo_id', 'SmilingWolf/wd-v1-4-convnextv2-tagger-v2', '--thresh', '0.4', 'C:/Users/billipem/AIUI/Training_Data/BSQ/New']' returned non-zero exit status 1. 23:44:54-058312 INFO ...captioning done

b-fission commented 5 months ago

... but I checked and made certain that my computer is CUDA-capable and has the latest CUDA installed. I've tried manually identifying my GPU in the setup and using a fresh install.

If you're using the latest CUDA toolkit version 12, that will not work. The components that kohya_ss installed by default are currently dependent on CUDA toolkit version 11.8.

Rcorthy commented 5 months ago

... but I checked and made certain that my computer is CUDA-capable and has the latest CUDA installed. I've tried manually identifying my GPU in the setup and using a fresh install.

If you're using the latest CUDA toolkit version 12, that will not work. The components that kohya_ss installed by default are currently dependent on CUDA toolkit version 11.8.

I tried with 11.8 before trying the newest version and went ahead and tried it again just in case. I have what appears to be the same errors. Could it be that my system is not powerful enough anymore? I'm running on a Nvidia Quadro T1200 with 4GB of VRAM.

b-fission commented 5 months ago

I tried it on an older laptop with an Nvidia GTX 1060. That one is older than a Quadro T1200 and WD14 was able to run using onnx perfectly fine. Didn't even need to install CUDA Toolkit.

Now I'm thinking that If your system was able to run it before, it should remain capable of running that. What's your nvidia driver version, is it reasonably recent too?

Rcorthy commented 5 months ago

I tried it on an older laptop with an Nvidia GTX 1060. That one is older than a Quadro T1200 and WD14 was able to run using onnx perfectly fine. Didn't even need to install CUDA Toolkit.

Now I'm thinking that If your system was able to run it before, it should remain capable of running that. What's your nvidia driver version, is it reasonably recent too?

My driver is the most recent one for my card (version 555.85). If I remember right, updating my driver was one of the things I tried when I first troubleshot. I just double-checked my CUDA, though, and it's still in version 12.5. I'll have to try reverting to 11.8 again and see if it actually works this time.

Rcorthy commented 5 months ago

Tried again with CUDA 11.8 actually installed this time. Seems to be the same issue still.

b-fission commented 5 months ago

Does the gui startup log indicate what GPU was detected on your machine?

Mine looks like this:


INFO     Kohya_ss GUI version: v24.1.4                                                                               
INFO     Submodule initialized and updated.                                                                          
INFO     nVidia toolkit detected                                                                                     
INFO     Torch 2.1.2+cu118                                                                                           
INFO     Torch backend: nVidia CUDA 11.8 cuDNN 8700                                                                  
INFO     Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24215 Arch (8, 9) Cores 128                                
INFO     Python version is 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]                                        
INFO     Verifying modules installation status from /home/user/kohya_ss/requirements_linux.txt...       
INFO     Verifying modules installation status from requirements.txt...```
Rcorthy commented 5 months ago

Does the gui startup log indicate what GPU was detected on your machine?

Mine looks like this:

INFO     Kohya_ss GUI version: v24.1.4                                                                               
INFO     Submodule initialized and updated.                                                                          
INFO     nVidia toolkit detected                                                                                     
INFO     Torch 2.1.2+cu118                                                                                           
INFO     Torch backend: nVidia CUDA 11.8 cuDNN 8700                                                                  
INFO     Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24215 Arch (8, 9) Cores 128                                
INFO     Python version is 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]                                        
INFO     Verifying modules installation status from /home/user/kohya_ss/requirements_linux.txt...       
INFO     Verifying modules installation status from requirements.txt...```

Looks like it is: 21:05:09-724719 INFO Kohya_ss GUI version: v24.1.4 21:05:12-751116 INFO Submodule initialized and updated. 21:05:12-766119 INFO nVidia toolkit detected 21:05:18-048407 INFO Torch 2.1.2+cu118 21:05:18-129426 INFO Torch backend: nVidia CUDA 11.8 cuDNN 8700 21:05:18-133428 INFO Torch detected GPU: NVIDIA T1200 Laptop GPU VRAM 4096 Arch (7, 5) Cores 16 21:05:18-141428 INFO Python version is 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] 21:05:18-144425 INFO Verifying modules installation status from requirements_pytorch_windows.txt... 21:05:18-146425 INFO Verifying modules installation status from requirements_windows.txt... 21:05:18-149435 INFO Verifying modules installation status from requirements.txt... 21:05:35-483629 INFO headless: False 21:05:35-534630 INFO Using shell=True when running external commands...

b-fission commented 5 months ago

The log seems normal.

Next idea: can you look for the folder at C:\Users\yourname\.cache\huggingface\accelerate and see if there is a file called default_config.yaml in there? If that yaml file is there, just delete it (or move/rename/etc).

Rcorthy commented 5 months ago

The log seems normal.

Next idea: can you look for the folder at C:\Users\yourname\.cache\huggingface\accelerate and see if there is a file called default_config.yaml in there? If that yaml file is there, just delete it (or move/rename/etc).

Deleted the file and tried again without onnx. I got the exact same error message I was getting before when not using onnx.

b-fission commented 5 months ago

What's the result for running it with onnx?

Rcorthy commented 5 months ago

Good sir, you are a saint. Running with onnx and deleting the .yaml worked. I really appreciate your help.

b-fission commented 5 months ago

Excellent