bmaltais / kohya_ss

Apache License 2.0
9.51k stars 1.23k forks source link

Subfolders name issue in Linux pod #406

Closed Norian11 closed 1 year ago

Norian11 commented 1 year ago

Well I'm in a Linux pod and I launched the UI but when i clic in Train it always sends me an error saying that the folder dataset name is wrong, but it seems all normal, someone knows why happens?. My dataset subfolder is 15_sks but it creates the next error:

File "/workspace/kohya_ss/lora_gui.py", line 407, in trainmodel repeats = int(folder.split('')[0]) ValueError: invalid literal for int() with base 10: '.ipynb

I think it should be okay to erase things and write "repeats = 5" I'm not sure if that would cause other errors but I will try.

By the way, I was looking in the code the part where it defines the instance-token and class-token based on the subfolder name and I didn't find that, does the name no matter anymore or someone knows where I can find that part of the code too?

bmaltais commented 1 year ago

Can you paste the full command being run? This look like the subfolder name is not being read properly...

Try adding print(folder) right before line 407 to print out what it is trying to parse... might help troubleshoot...

Norian11 commented 1 year ago

Ok I will do that, just need two hour to get in my computer and I will send the full error

bmaltais commented 1 year ago

Hopefully that print(folder) line will shine some light as to why it is failing to extract the repeat value from the subfolder name.

Norian11 commented 1 year ago

Hi, a question in print(folder) do you mean that i have to copy there the path of an specific folder or just leave it like that?

Here is the longer traceback of that issue

Folder 15_sks: 30 steps Traceback (most recent call last): File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/routes.py", line 384, in run_predict output = await app.get_blocks().process_api( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1024, in process_api result = await self.call_function( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 836, in call_function prediction = await anyio.to_thread.run_sync( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run result = context.run(func, *args) File "/workspace/kohya_ss/dreambooth_gui.py", line 336, in trainmodel repeats = int(folder.split('')[0]) ValueError: invalid literal for int() with base 10: '.ipynb'

Norian11 commented 1 year ago

well i tried just writing "repeats = 5" but it generates a billion of other errors, i just think that this cant be run in services like vast.ia that uses a jupyter UI. Im gonna try again in a Linux Desktop Template that has a normal computer interface.

Folder 15_sks: 2 images found Folder 15_sks: 10 steps Folder .ipynb_checkpoints: 0 images found Folder .ipynb_checkpoints: 0 steps max_train_steps = 10 stop_text_encoder_training = 0 lr_warmup_steps = 1 accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --enable_bucket --pretrained_model_name_or_path="/workspace/Lora/pretrained/Deliberate.safetensors" --train_data_dir="/workspace/Lora/data" --resolution=512,512 --output_dir="/workspace/Lora/output" --logging_dir="/workspace/Lora/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-5 --unet_lr=0.0001 --network_dim=8 --output_name="last" --lr_scheduler_num_cycles="1" --learning_rate="0.0001" --lr_scheduler="cosine" --lr_warmup_steps="1" --train_batch_size="1" --max_train_steps="10" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --cache_latents --optimizer_type="AdamW" --bucket_reso_steps=64 --xformers --bucket_no_upscale 2023-03-20 15:58:05.138133: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-20 15:58:05.294123: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-03-20 15:58:05.895761: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-03-20 15:58:05.895830: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-03-20 15:58:05.895841: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2023-03-20 15:58:07.883114: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-20 15:58:08.044271: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-03-20 15:58:08.627422: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-03-20 15:58:08.627489: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-03-20 15:58:08.627501: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. /lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.32' not found (required by /workspace/kohya_ss/venv/lib/python3.10/site-packages/xformers/_C.so) WARNING:root:WARNING: /lib/x86_64-linux-gnu/libc.so.6: versionGLIBC_2.32' not found (required by /workspace/kohya_ss/venv/lib/python3.10/site-packages/xformers/_C.so) Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop Traceback (most recent call last): File "/workspace/kohya_ss/train_network.py", line 16, in import library.train_util as train_util File "/workspace/kohya_ss/library/train_util.py", line 39, in import albumentations as albu File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/albumentations/init.py", line 5, in from .augmentations import File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/albumentations/augmentations/init.py", line 2, in from .blur.functional import File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/albumentations/augmentations/blur/init.py", line 1, in from .functional import * File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/albumentations/augmentations/blur/functional.py", line 5, in import cv2 File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/cv2/init.py", line 181, in bootstrap() File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/cv2/init.py", line 153, in bootstrap native_module = importlib.import_module("cv2") File "/opt/conda/lib/python3.10/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ImportError: libGL.so.1: cannot open shared object file: No such file or directory Traceback (most recent call last): File "/workspace/kohya_ss/venv/bin/accelerate", line 8, in sys.exit(main()) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1104, in launch_command simple_launcher(args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/workspace/kohya_ss/venv/bin/python3', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=/workspace/Lora/pretrained/Deliberate.safetensors', '--train_data_dir=/workspace/Lora/data', '--resolution=512,512', '--output_dir=/workspace/Lora/output', '--logging_dir=/workspace/Lora/log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-5', '--unet_lr=0.0001', '--network_dim=8', '--output_name=last', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=1', '--train_batch_size=1', '--max_train_steps=10', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--optimizer_type=AdamW', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.

adrianlungu commented 1 year ago

Just ran into this issue as well as, as it turns out, there was a folder called .ipynb_checkpoints in the Image Folder.

Using ls -la you can check if there are any hidden folders.

This is what was making repeats = int(folder.split('')[0]) panic as it does not yield what it's expecting.

Maybe it's a good idea to ignore hidden folders ? There are systems that generate hidden folders for various reasons which should maybe be ignored for deducing the number of repeats.

adrianlungu commented 1 year ago

@Norian11 as for your second error, which I also ran into afterwards, it seems in some Docker containers, some dependencies are missing, suggested by the ImportError: libGL.so.1: cannot open shared object file: No such file or directory line.

I ran apt-get update && apt-get install ffmpeg libsm6 libxext6 -y inside the container via ssh and then got past that error.

bmaltais commented 1 year ago

Just ran into this issue as well as, as it turns out, there was a folder called .ipynb_checkpoints in the Image Folder.

Using ls -la you can check if there are any hidden folders.

This is what was making repeats = int(folder.split('')[0]) panic as it does not yield what it's expecting.

Maybe it's a good idea to ignore hidden folders ? There are systems that generate hidden folders for various reasons which should maybe be ignored for deducing the number of repeats.

This is a good idea... I will add a check for hidden folder and ignore... I bet someone will eventually complain but I thing more users will benefit from it ;-)

bmaltais commented 1 year ago

I have pushed the fix to the dev branch.

adrianlungu commented 1 year ago

There is still an issue I'm encountering on my instance over which I haven't been able to get over yet, but I'm gonna open another issue since it's unrelated to this

Norian11 commented 1 year ago

I haven't tried it yet, I will launch it today but by the comments it seems solved. Thank you so much guys for your solution now we can finally run in without paying that much for Collab, thanks for your work!

ohminy commented 1 year ago

Just ran into this issue as well as, as it turns out, there was a folder called .ipynb_checkpoints in the Image Folder. Using ls -la you can check if there are any hidden folders. This is what was making repeats = int(folder.split('')[0]) panic as it does not yield what it's expecting. Maybe it's a good idea to ignore hidden folders ? There are systems that generate hidden folders for various reasons which should maybe be ignored for deducing the number of repeats.

This is a good idea... I will add a check for hidden folder and ignore... I bet someone will eventually complain but I thing more users will benefit from it ;-)

How can I ignore hidden folders?? Did you find right way?

adrianlungu commented 1 year ago

@ohminy this was already updated by @bmaltais over here: https://github.com/bmaltais/kohya_ss/pull/424

I personally just deleted the hidden folders meanwhile.