bmaltais / kohya_ss

Apache License 2.0
9.54k stars 1.23k forks source link

Linux and AMD Radeon RX 7900XTX: RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' | Is it possible to enable the stable-diffusion-webui equivalent to --precision full --no-half? #1793

Closed danielaixer closed 10 months ago

danielaixer commented 10 months ago

I'm on Ubuntu 22.04, with 7900XTX GPU, ROCm5.6 and Mesa drivers. I can generate images using GPU via stable-diffusion-webui.

I have installed koyha_ss with these commands:

git clone https://github.com/bmaltais/kohya_ss.git 
cd kohya_ss
python -m venv venv
source venv/bin/activate
pip install --use-pep517 --upgrade -r requirements.txt
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm5.6
accelerate config

And I start the GUI with:

export HSA_OVERRIDE_GFX_VERSION=11.0.0
source venv/bin/activate
python kohya_gui.py --server_port 7863 --listen 0.0.0.0

I'm trying to train a LoRA model using the optimizer AdamW and with CrossAttention set to none. These parameters help me avoid bitandbytes and xFormers errors, but just when it seems it's working and getting to the optimization steps I get this error:

  File "/home/username/kohya_ss/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

And at the end of the terminal this: subprocess.CalledProcessError: Command '['/home/username/kohya_ss/kohya_ss/venv/bin/python', './train_network.py', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=/home/username/kohya_ss/kohya_ss/datasets/Something', '--resolution=512,512', '--output_dir=/home/username/kohya_ss/kohya_ss/models/Lora/Custom', '--network_alpha=48', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-05', '--unet_lr=0.0001', '--network_dim=96', '--output_name=Something2', '--lr_scheduler_num_cycles=1', '--no_half_vae', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=20', '--train_batch_size=4', '--max_train_steps=200', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--optimizer_type=AdamW', '--max_grad_norm=1', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--save_every_n_steps=500', '--bucket_no_upscale', '--noise_offset=0.0', '--sample_sampler=euler', '--sample_prompts=/home/username/kohya_ss/kohya_ss/models/Lora/Custom/sample/prompt.txt', '--sample_every_n_steps=25']' returned non-zero exit status 1.

Based on similar errors mentioning 'Half', I'm pretty sure we need que equivalent of using --precision full --no-half when launching AUTOMATIC1111/stable-diffusion-webui.

The method shown here doesn't improve the situation for me: https://github.com/bmaltais/kohya_ss/issues/1484 Including installing PyTorch ROCm5.7: pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/rocm5.7

Edit: When running "accelerate config", choosing "no" for the question "Do you wish to use FP16 or BF16 (mixed precision)?" didn't help.

Edit: Setting "Mixed precision" to "no" seems to be working, I will update one I confirm I can do a complete LoRA training.

danielaixer commented 10 months ago

Okay, confirmed, "Mixed precision" set to "no" works. Regarding "accelerate config", I think it doesn't really matter which mixed precision you choose.

Also, do NOT use AdamW8bit as optimizer (bitandbytes issue), use AdamW instead, and set "CrossAttention" to "none" (xFormers issue).

However, I still can't generate sample images nor captions with kohya_ss, but those issues are secondary.