Linaqruf / kohya-trainer

Adapted from https://note.com/kohya_ss/n/nbf7ce8d80f29 for easier cloning
Apache License 2.0
1.86k stars 306 forks source link

Tensorcore errors and it will not run #37

Closed DarkAlchy closed 1 year ago

DarkAlchy commented 1 year ago

2023-01-26 19:54:55.306109: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-01-26 19:54:56.013115: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2023-01-26 19:54:56.013226: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2023-01-26 19:54:56.013246: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. found 0 images. no metadata / メタデータファイルがありません: /content/drive/MyDrive/fine_tune/meta_clean.json

/content/drive/MyDrive/kohya-trainer/finetune 2023-01-26 20:01:45.449924: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-01-26 20:01:46.157549: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2023-01-26 20:01:46.157670: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2023-01-26 20:01:46.157697: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. load images from /content/drive/MyDrive/Aliens found 4 images. loading BLIP caption: https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth Downloading (…)solve/main/vocab.txt: 100% 232k/232k [00:00<00:00, 263kB/s] Downloading (…)okenizer_config.json: 100% 28.0/28.0 [00:00<00:00, 10.9kB/s] Downloading (…)lve/main/config.json: 100% 570/570 [00:00<00:00, 212kB/s] 100% 1.66G/1.66G [01:14<00:00, 24.0MB/s] load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth BLIP loaded 0% 0/4 [00:00<?, ?it/s]convert image mode RGBA to RGB: /content/drive/MyDrive/Aliens/maxresdefault.png 25% 1/4 [00:00<00:02, 1.03it/s]convert image mode RGBA to RGB: /content/drive/MyDrive/Aliens/newsgeek.png 50% 2/4 [00:01<00:01, 1.03it/s]convert image mode RGBA to RGB: /content/drive/MyDrive/Aliens/FGddysgUYAMQ-sJ.png 100% 4/4 [00:03<00:00, 1.06it/s] done! 2023-01-26 20:03:42.032881: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-01-26 20:03:42.958994: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia 2023-01-26 20:03:42.959110: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia 2023-01-26 20:03:42.959128: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. downloading wd14 tagger model from hf_hub /usr/local/lib/python3.8/dist-packages/huggingface_hub/file_download.py:1020: FutureWarning: The force_filename parameter is deprecated as a new caching system, which keeps the filenames as they are on the Hub, is now in place. warnings.warn( Downloading (…)"keras_metadata.pb";: 100% 328k/328k [00:00<00:00, 25.9MB/s] Downloading (…)"saved_model.pb";: 100% 3.81M/3.81M [00:00<00:00, 134MB/s] Downloading (…)in/selected_tags.csv: 100% 174k/174k [00:00<00:00, 259kB/s] Downloading (…)ata-00000-of-00001";: 100% 365M/365M [00:01<00:00, 280MB/s] Downloading (…)"variables.index";: 100% 13.8k/13.8k [00:00<00:00, 6.07MB/s] found 4 images. loading model and labels 2023-01-26 20:03:53.629195: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. WARNING:tensorflow:No training configuration found in save file, so the model was not compiled. Compile it manually. WARNING:tensorflow:No training configuration found in save file, so the model was not compiled. Compile it manually. 100% 4/4 [00:00<00:00, 23.32it/s] done!

Went on ahead and tried to train to get these: image

Looks like something is wrong and possibly using the CPU because this is longer than even a full on Dreambooth training.

image

Linaqruf commented 1 year ago

no, it runs well, nothing wrong with tensorflow notification, there is a bug in colab where it can't find tensorrt path but it doesn't affect the training. also bitsandbytes notification, it automatically find the path itself. Detected Cuda Version 11.2 means you're running on gpu not cpu

and why it's longer? because you set train_batch_size to 4 and max_train_steps to 5000, it's equal to 20k max train steps. You can try lower your train step or train batch size.

Also try to not set max train step too higher if your images is only 4, your model will overfitted easily

DarkAlchy commented 1 year ago

I tried 1 and 1000 but it didn't work either then my daily time limit on colab had expired. I will try this again tomorrow but so far no luck with colab.

btw, I am used to TI/HN, and DB and I don't have to give them filewords but with this I am forced (DB extension doesn't force that either). My first really stop the program error was because I didn't blip so it had no filewords (captions). I hardly ever use captioning, and try not to.

DarkAlchy commented 1 year ago

No, there is an issue afterall

modules.devices.NansException: A tensor with all NaNs was produced in Unet. This could be either because there's not enough precision to represent the picture, or because your video card does not support half type. Try setting the "Upcast cross attention layer to float32" option in Settings > Stable Diffusion or using the --no-half commandline argument to fix this. Use --disable-nan-check commandline argument to disable this check.

If that isn't bad enough, here is what I saw:

CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda112.so... use 8-bit Adam optimizer override steps. steps for 100 epochs is / 指定エポックまでのステップ数: 200 running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 32 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 2 num epochs / epoch数: 100 batch size per device / バッチサイズ: 16 total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 16 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 200 steps: 0% 0/200 [00:00<?, ?it/s]epoch 1/100 steps: 1% 2/200 [00:22<36:48, 11.16s/it, loss=0.406]epoch 2/100 steps: 2% 4/200 [00:37<30:55, 9.47s/it, loss=0.401]epoch 3/100 steps: 3% 6/200 [00:53<28:50, 8.92s/it, loss=0.389]epoch 4/100 steps: 4% 8/200 [01:09<27:40, 8.65s/it, loss=0.39] epoch 5/100 steps: 5% 10/200 [01:24<26:54, 8.50s/it, loss=0.389]epoch 6/100 steps: 6% 12/200 [01:40<26:19, 8.40s/it, loss=0.409]epoch 7/100 steps: 7% 14/200 [01:56<25:51, 8.34s/it, loss=0.388]epoch 8/100 steps: 8% 16/200 [02:12<25:27, 8.30s/it, loss=0.41] epoch 9/100 steps: 9% 18/200 [02:28<25:05, 8.27s/it, loss=0.37] epoch 10/100 steps: 10% 20/200 [02:45<24:45, 8.25s/it, loss=0.461]epoch 11/100 steps: 11% 22/200 [03:01<24:27, 8.24s/it, loss=0.394]epoch 12/100 steps: 12% 24/200 [03:17<24:09, 8.24s/it, loss=0.373]epoch 13/100 steps: 13% 26/200 [03:34<23:52, 8.23s/it, loss=0.423]epoch 14/100 steps: 14% 28/200 [03:50<23:35, 8.23s/it, loss=0.356]epoch 15/100 steps: 15% 30/200 [04:06<23:18, 8.23s/it, loss=0.38] epoch 16/100 steps: 16% 32/200 [04:23<23:01, 8.22s/it, loss=0.405]epoch 17/100 steps: 17% 34/200 [04:39<22:45, 8.22s/it, loss=0.381]epoch 18/100 steps: 18% 36/200 [04:55<22:28, 8.22s/it, loss=0.378]epoch 19/100 steps: 19% 38/200 [05:12<22:11, 8.22s/it, loss=0.39] epoch 20/100 steps: 20% 40/200 [05:28<21:54, 8.22s/it, loss=0.392]epoch 21/100 steps: 21% 42/200 [05:45<21:38, 8.22s/it, loss=0.391]epoch 22/100 steps: 22% 44/200 [06:01<21:21, 8.22s/it, loss=0.386]epoch 23/100 steps: 23% 46/200 [06:17<21:05, 8.22s/it, loss=0.376]epoch 24/100 steps: 24% 48/200 [06:34<20:48, 8.21s/it, loss=0.398]epoch 25/100 steps: 25% 50/200 [06:50<20:32, 8.21s/it, loss=0.406]epoch 26/100 steps: 26% 52/200 [07:07<20:15, 8.21s/it, loss=0.369]epoch 27/100 steps: 27% 54/200 [07:23<19:59, 8.21s/it, loss=0.4] epoch 28/100 steps: 28% 56/200 [07:39<19:42, 8.21s/it, loss=0.39] epoch 29/100 steps: 29% 58/200 [07:56<19:26, 8.21s/it, loss=0.37] epoch 30/100 steps: 30% 60/200 [08:12<19:09, 8.21s/it, loss=0.379]epoch 31/100 steps: 31% 62/200 [08:29<18:53, 8.22s/it, loss=0.4] epoch 32/100 steps: 32% 64/200 [08:45<18:37, 8.21s/it, loss=0.384]epoch 33/100 steps: 33% 66/200 [09:02<18:20, 8.21s/it, loss=0.375]epoch 34/100 steps: 34% 68/200 [09:18<18:04, 8.21s/it, loss=0.377]epoch 35/100 steps: 35% 70/200 [09:34<17:47, 8.21s/it, loss=0.376]epoch 36/100 steps: 36% 72/200 [09:51<17:31, 8.21s/it, loss=0.366]epoch 37/100 steps: 37% 74/200 [10:07<17:14, 8.21s/it, loss=0.366]epoch 38/100 steps: 38% 76/200 [10:24<16:58, 8.21s/it, loss=0.384]epoch 39/100 steps: 39% 78/200 [10:40<16:41, 8.21s/it, loss=0.357]epoch 40/100 steps: 40% 80/200 [10:56<16:25, 8.21s/it, loss=0.373]epoch 41/100 steps: 41% 82/200 [11:13<16:09, 8.21s/it, loss=0.369]epoch 42/100 steps: 42% 84/200 [11:29<15:52, 8.21s/it, loss=0.362]epoch 43/100 steps: 43% 86/200 [11:46<15:36, 8.21s/it, loss=0.385]epoch 44/100 steps: 44% 88/200 [12:02<15:19, 8.21s/it, loss=0.378]epoch 45/100 steps: 45% 90/200 [12:19<15:03, 8.21s/it, loss=0.404]epoch 46/100 steps: 46% 92/200 [12:35<14:46, 8.21s/it, loss=0.381]epoch 47/100 steps: 47% 94/200 [12:51<14:30, 8.21s/it, loss=0.395]epoch 48/100 steps: 48% 96/200 [13:08<14:14, 8.21s/it, loss=0.386]epoch 49/100 steps: 49% 98/200 [13:24<13:57, 8.21s/it, loss=0.372]epoch 50/100 steps: 50% 100/200 [13:41<13:41, 8.21s/it, loss=0.397]epoch 51/100 steps: 51% 102/200 [13:57<13:24, 8.21s/it, loss=0.404]epoch 52/100 steps: 52% 104/200 [14:13<13:08, 8.21s/it, loss=0.387]epoch 53/100 steps: 53% 106/200 [14:30<12:51, 8.21s/it, loss=0.411]epoch 54/100 steps: 54% 108/200 [14:46<12:35, 8.21s/it, loss=0.379]epoch 55/100 steps: 55% 110/200 [15:03<12:18, 8.21s/it, loss=0.359]epoch 56/100 steps: 56% 112/200 [15:19<12:02, 8.21s/it, loss=0.386]epoch 57/100 steps: 57% 114/200 [15:35<11:45, 8.21s/it, loss=0.366]epoch 58/100 steps: 58% 116/200 [15:52<11:29, 8.21s/it, loss=0.387]epoch 59/100 steps: 59% 118/200 [16:08<11:13, 8.21s/it, loss=0.368]epoch 60/100 steps: 60% 120/200 [16:25<10:56, 8.21s/it, loss=0.391]epoch 61/100 steps: 61% 122/200 [16:41<10:40, 8.21s/it, loss=0.383]epoch 62/100 steps: 62% 124/200 [16:57<10:23, 8.21s/it, loss=0.436]epoch 63/100 steps: 63% 126/200 [17:14<10:07, 8.21s/it, loss=0.421]epoch 64/100 steps: 64% 128/200 [17:30<09:51, 8.21s/it, loss=0.401]epoch 65/100 steps: 65% 130/200 [17:47<09:34, 8.21s/it, loss=0.381]epoch 66/100 steps: 66% 132/200 [18:03<09:18, 8.21s/it, loss=0.373]epoch 67/100 steps: 67% 134/200 [18:19<09:01, 8.21s/it, loss=0.391]epoch 68/100 steps: 68% 136/200 [18:36<08:45, 8.21s/it, loss=0.39]epoch 69/100 steps: 69% 138/200 [18:52<08:28, 8.21s/it, loss=0.397]epoch 70/100 steps: 70% 140/200 [19:09<08:12, 8.21s/it, loss=0.419]epoch 71/100 steps: 71% 142/200 [19:25<07:56, 8.21s/it, loss=0.404]epoch 72/100 steps: 72% 144/200 [19:41<07:39, 8.21s/it, loss=0.401]epoch 73/100 steps: 73% 146/200 [19:58<07:23, 8.21s/it, loss=0.414]epoch 74/100 steps: 74% 148/200 [20:14<07:06, 8.21s/it, loss=0.42] epoch 75/100 steps: 75% 150/200 [20:31<06:50, 8.21s/it, loss=0.421]epoch 76/100 steps: 76% 152/200 [20:47<06:34, 8.21s/it, loss=0.402]epoch 77/100 steps: 77% 154/200 [21:04<06:17, 8.21s/it, loss=0.43] epoch 78/100 steps: 78% 156/200 [21:20<06:01, 8.21s/it, loss=0.426]epoch 79/100 steps: 79% 158/200 [21:36<05:44, 8.21s/it, loss=0.411]epoch 80/100 steps: 80% 160/200 [21:53<05:28, 8.21s/it, loss=0.427]epoch 81/100 steps: 81% 162/200 [22:09<05:11, 8.21s/it, loss=0.515]epoch 82/100 steps: 82% 164/200 [22:25<04:55, 8.21s/it, loss=0.483]epoch 83/100 steps: 83% 166/200 [22:42<04:38, 8.21s/it, loss=0.513]epoch 84/100 steps: 84% 168/200 [22:58<04:22, 8.20s/it, loss=0.6] epoch 85/100 steps: 85% 170/200 [23:14<04:06, 8.20s/it, loss=0.566]epoch 86/100 steps: 86% 172/200 [23:30<03:49, 8.20s/it, loss=0.574]epoch 87/100 steps: 87% 174/200 [23:47<03:33, 8.20s/it, loss=0.584]epoch 88/100 steps: 88% 176/200 [24:03<03:16, 8.20s/it, loss=0.612]epoch 89/100 steps: 89% 178/200 [24:19<03:00, 8.20s/it, loss=0.657]epoch 90/100 steps: 90% 180/200 [24:35<02:43, 8.20s/it, loss=0.676]epoch 91/100 steps: 91% 182/200 [24:52<02:27, 8.20s/it, loss=0.878]epoch 92/100 steps: 92% 184/200 [25:08<02:11, 8.20s/it, loss=0.776]epoch 93/100 steps: 93% 186/200 [25:24<01:54, 8.20s/it, loss=0.701]epoch 94/100 steps: 94% 188/200 [25:40<01:38, 8.20s/it, loss=0.719]epoch 95/100 steps: 95% 190/200 [25:56<01:21, 8.19s/it, loss=0.727]epoch 96/100 steps: 96% 192/200 [26:12<01:05, 8.19s/it, loss=0.682]epoch 97/100 steps: 97% 194/200 [26:29<00:49, 8.19s/it, loss=0.834]epoch 98/100 steps: 98% 196/200 [26:44<00:32, 8.19s/it, loss=nan]epoch 99/100 steps: 99% 198/200 [26:59<00:16, 8.18s/it, loss=nan]epoch 100/100 steps: 100% 200/200 [27:15<00:00, 8.18s/it, loss=nan]save trained model to /content/drive/MyDrive/fine_tune/output/Cy0n.safetensors model saved. steps: 100% 200/200 [27:16<00:00, 8.18s/it, loss=nan]

45m fighting this and preparing this plus the actual training down the drain (to go along with the 5h from the other day). What is going on? Why did the loss keep going up like that until it hit a nan?

Linaqruf commented 1 year ago

batch size 16

colab pro? i think you need to compile new xformers, because mine only support T4, which is limited to colab free or you can find it at https://github.com/TheLastBen/fast-stable-diffusion/tree/main/precompiled

DarkAlchy commented 1 year ago

batch size 16

colab pro? i think you need to compile new xformers, because mine only support T4, which is limited to colab free or you can find it at https://github.com/TheLastBen/fast-stable-diffusion/tree/main/precompiled

No way am I paying for colab so it is free. Where do I stick the xformers since it redownloads each time?