ShivamShrirao / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch
https://huggingface.co/docs/diffusers
Apache License 2.0
1.89k stars 507 forks source link

CUDA Error #6

Open greendesertsnow opened 2 years ago

greendesertsnow commented 2 years ago

Describe the bug

Start Training step produces a CUBLAS_STATUS_INTERNAL_ERROR

Reproduction

Do the regular steps with --max_train_steps=1500

Logs

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:99: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
  f'{candidate_env_vars["LD_LIBRARY_PATH"]} did not contain '
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('"/usr/local/bin/dap_multiplexer","enableLsp"'), PosixPath('true}'), PosixPath('{"kernelManagerProxyPort"'), PosixPath('"172.28.0.3","jupyterArgs"'), PosixPath('6000,"kernelManagerProxyHost"'), PosixPath('["--ip=172.28.0.2"],"debugAdapterMultiplexerPath"')}
  "WARNING: The following directories listed in your path were found to "
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
  "WARNING: The following directories listed in your path were found to "
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
  "WARNING: The following directories listed in your path were found to "
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 111
CUDA SETUP: Loading binary /usr/local/lib/python3.7/dist-packages/bitsandbytes/libbitsandbytes_cuda111.so...
Caching latents: 100% 26/26 [00:02<00:00,  9.91it/s]
Steps:   0% 0/1500 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train_dreambooth.py", line 658, in <module>
    main()
  File "train_dreambooth.py", line 600, in main
    noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/utils/operations.py", line 507, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.7/dist-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/unet_2d_condition.py", line 309, in forward
    upsample_size=upsample_size,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/unet_blocks.py", line 1151, in forward
    hidden_states = attn(hidden_states, context=encoder_hidden_states)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 154, in forward
    hidden_states = block(hidden_states, context=context)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 205, in forward
    hidden_states = self.ff(self.norm3(hidden_states)) + hidden_states
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 335, in forward
    return self.net(hidden_states)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
Steps:   0% 0/1500 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--use_auth_token', '--instance_data_dir=/content/data/fatihInput', '--class_data_dir=/content/data/person', '--output_dir=/content/drive/MyDrive/stable_diffusion_weights/fatihOutput', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=fatih', '--class_prompt=person', '--seed=1337', '--resolution=512', '--center_crop', '--train_batch_size=1', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=12', '--sample_batch_size=4', '--max_train_steps=1500']' returned non-zero exit status 1.

System Info

A100 Google Colab Pro

ShivamShrirao commented 2 years ago

Hi, I have had same error on some systems, I couldn't figure out why yet. But installing an older version of xformers with pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers fixed it for me.

dawah-wadah commented 2 years ago

Same error, I will be trying to compile from the older version, and see if I can't put up a PR with the new wheel up soon

GraysonSaysHi commented 2 years ago

Hi, I have had same error on some systems, I couldn't figure out why yet. But installing an older version of xformers with pip install git+https://github.com/facebookresearch/xformers@51dd119#egg=xformers fixed

Did it take a while to download?

geocine commented 2 years ago

Could be a good reference to help resolve the issue?

https://github.com/TheLastBen/fast-stable-diffusion/issues/27