bmaltais / kohya_ss

Apache License 2.0
9.65k stars 1.24k forks source link

Attempting to register factory for plugin cuDNN/cuFFT/cuBLAS on Linux install #2263

Open dr460neye opened 7 months ago

dr460neye commented 7 months ago

Hi there,

i tried now each feasible way to install the WebUI on a Linux server with multiple GPUs.

There are some smaller issues identified:

When I use common commands for CUDA version checkup and installation verification, only tensorflow and torch commands fail. This problem was fixed for Ubuntu in Version 2.16.

It seems that it was detected for WSL users, but still appears on other Ubuntu installations.

Server:

Errors:

accelerate launch --mixed_precision="fp16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=2 "/home/excel/kohya_ss/sd-scripts/train_network.py" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --huber_c="0.1" --huber_schedule="snr" --learning_rate="0.0001" --logging_dir="/home/excel/kohya_ss/logs" --loss_type="l2" --lr_scheduler="cosine" --lr_scheduler_num_cycles="1" --lr_warmup_steps="57" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="512,512" --max_train_steps="570" --min_timestep=0 --mixed_precision="fp16" --network_alpha="1" --network_dim=8 --network_module=networks.lora --optimizer_type="AdamW8bit" --output_dir="/home/excel/kohya_ss/outputs/hedgeforest" --output_name="hedgeforest" --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="fp16" --text_encoder_lr=0.0001 --train_batch_size="1" --train_data_dir="/home/excel/bildersets/hedgehogs/images" --unet_lr=0.0001 --xformers 2024-04-11 16:08:53.335832: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2024-04-11 16:08:53.376158: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-04-11 16:08:53.376187: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-04-11 16:08:53.377607: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-04-11 16:08:53.384410: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2024-04-11 16:08:53.384622: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-04-11 16:08:54.317341: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

So i kindly request to upgrade to tensorflow 2.16 and also add the "cuda" options for the pip package installation as a default requirements_nvidia.txt

bmaltais commented 7 months ago

Not sure I totally get what changes you are asking for... there is no requirements_nvidia.txt file... so what actual requirements do you want to see in it?

bmaltais commented 7 months ago

Which requirements file need the upgrade to tensorflow 2.16? How do you install the GUI? Do you use special parameters to specify the requirements file? I don't use linux so I am not familiar with how you need the current solution to be changed and updated to properly work on your platform...

dr460neye commented 7 months ago

I switched to: tensorboard==2.16.2 tensorflow[and-cuda]==2.16.1 The tensorflow instructed package ensures that CDNN etc are installed. 2.16 is used to ensure that the error for Ubuntu WSL and Server is fixed

As not every GPU is an nvidia, i suggested that we add a requirements_nvidia.txt, which contains the cuda package as default, while for other setups the normal requirements_linux.txt is used

bmaltais commented 7 months ago

But how will you call this? Is the setup.sh going to handle this as is? I think the best would be if you create a pull request to propose all the needed code change to make this work properly. That way I can merge it and others will be able to use it…

ja1496 commented 7 months ago

I execute it by pulling kohya_ss on the Ubuntu system/ Before setup.sh, please modify the requirements in both requests.linux.txt and requests.linux_docker.txt———— Tensorboard=2.16.2 tensorflow=2.16.1 and torch=2.2.1 torch vision=0.17.1 torch studio=2.2.1-- index URL https://download.pytorch.org/whl/cu121 —————— Solved the above issues. But I also believe that it may be due to the activation of secureroot in the bios that the driver of NVIDIA on Ubuntu is not working, causing the above problem. But now I can run normally using multiple GPUs or conducting distributed training using deepspeed

ja1496 commented 7 months ago

Python 3.10.11 Ubuntu 22.04 nvidia driver 545

sirius422 commented 5 months ago

I switched to: tensorboard==2.16.2 tensorflow[and-cuda]==2.16.1 The tensorflow instructed package ensures that CDNN etc are installed. 2.16 is used to ensure that the error for Ubuntu WSL and Server is fixed

As not every GPU is an nvidia, i suggested that we add a requirements_nvidia.txt, which contains the cuda package as default, while for other setups the normal requirements_linux.txt is used

Installing tensorflow[and-cuda] and adding this little script in the Tensorflow issue into gui.sh does solve the problem.

Now I can get [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')] when running python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))", no errors occures.

As for TensorRT, you may check this, download and extract the tar file and setup the LD_LIBRARY_PATH in gui.sh, use symlink if needed.

b-fission commented 5 months ago

kohya uses pytorch for GPU training, so any messages from tensorflow saying "unable to register ____ factory" or "could not find cuda drivers" can be ignored.

There's no practical use for installing a cuda-enabled build of tensorflow. It's only brought in as a dependency for tensorboard.