Open dr460neye opened 7 months ago
Not sure I totally get what changes you are asking for... there is no requirements_nvidia.txt file... so what actual requirements do you want to see in it?
Which requirements file need the upgrade to tensorflow 2.16? How do you install the GUI? Do you use special parameters to specify the requirements file? I don't use linux so I am not familiar with how you need the current solution to be changed and updated to properly work on your platform...
I switched to: tensorboard==2.16.2 tensorflow[and-cuda]==2.16.1 The tensorflow instructed package ensures that CDNN etc are installed. 2.16 is used to ensure that the error for Ubuntu WSL and Server is fixed
As not every GPU is an nvidia, i suggested that we add a requirements_nvidia.txt, which contains the cuda package as default, while for other setups the normal requirements_linux.txt is used
But how will you call this? Is the setup.sh going to handle this as is? I think the best would be if you create a pull request to propose all the needed code change to make this work properly. That way I can merge it and others will be able to use it…
I execute it by pulling kohya_ss on the Ubuntu system/ Before setup.sh, please modify the requirements in both requests.linux.txt and requests.linux_docker.txt———— Tensorboard=2.16.2 tensorflow=2.16.1 and torch=2.2.1 torch vision=0.17.1 torch studio=2.2.1-- index URL https://download.pytorch.org/whl/cu121 —————— Solved the above issues. But I also believe that it may be due to the activation of secureroot in the bios that the driver of NVIDIA on Ubuntu is not working, causing the above problem. But now I can run normally using multiple GPUs or conducting distributed training using deepspeed
Python 3.10.11 Ubuntu 22.04 nvidia driver 545
I switched to: tensorboard==2.16.2 tensorflow[and-cuda]==2.16.1 The tensorflow instructed package ensures that CDNN etc are installed. 2.16 is used to ensure that the error for Ubuntu WSL and Server is fixed
As not every GPU is an nvidia, i suggested that we add a requirements_nvidia.txt, which contains the cuda package as default, while for other setups the normal requirements_linux.txt is used
Installing tensorflow[and-cuda]
and adding this little script in the Tensorflow issue into gui.sh does solve the problem.
Now I can get [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
when running python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
, no errors occures.
As for TensorRT, you may check this, download and extract the tar file and setup the LD_LIBRARY_PATH in gui.sh, use symlink if needed.
kohya uses pytorch for GPU training, so any messages from tensorflow saying "unable to register ____ factory" or "could not find cuda drivers" can be ignored.
There's no practical use for installing a cuda-enabled build of tensorflow. It's only brought in as a dependency for tensorboard.
Hi there,
i tried now each feasible way to install the WebUI on a Linux server with multiple GPUs.
There are some smaller issues identified:
When I use common commands for CUDA version checkup and installation verification, only tensorflow and torch commands fail. This problem was fixed for Ubuntu in Version 2.16.
It seems that it was detected for WSL users, but still appears on other Ubuntu installations.
Server:
Errors:
accelerate launch --mixed_precision="fp16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=2 "/home/excel/kohya_ss/sd-scripts/train_network.py" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --huber_c="0.1" --huber_schedule="snr" --learning_rate="0.0001" --logging_dir="/home/excel/kohya_ss/logs" --loss_type="l2" --lr_scheduler="cosine" --lr_scheduler_num_cycles="1" --lr_warmup_steps="57" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="512,512" --max_train_steps="570" --min_timestep=0 --mixed_precision="fp16" --network_alpha="1" --network_dim=8 --network_module=networks.lora --optimizer_type="AdamW8bit" --output_dir="/home/excel/kohya_ss/outputs/hedgeforest" --output_name="hedgeforest" --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="fp16" --text_encoder_lr=0.0001 --train_batch_size="1" --train_data_dir="/home/excel/bildersets/hedgehogs/images" --unet_lr=0.0001 --xformers 2024-04-11 16:08:53.335832: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2024-04-11 16:08:53.376158: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-04-11 16:08:53.376187: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-04-11 16:08:53.377607: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-04-11 16:08:53.384410: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2024-04-11 16:08:53.384622: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-04-11 16:08:54.317341: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
So i kindly request to upgrade to tensorflow 2.16 and also add the "cuda" options for the pip package installation as a default requirements_nvidia.txt