RuntimeError: CUDA error: invalid argument

Describe the bug

When I go to run accelerate launch train_dreambooth.py, I get passed the "Caching latents" step but immediately after, when steps is at 0%, I get this CUDA error and I don't know why. My GPU doesn't seem to be out of memory as can be seen when running nvidia-smi, my torch and cudatoolkit are up to date, and my xformers is up to date as well.

Here is the error:

(diffusers) D:\Videos\AI\Stable Diffusion\dreambooth-xformers>accelerate launch train_dreambooth.py --pretrained_model_name_or_path=./models/sd15 --instance_data_dir=data/%name%/images --output_dir=data/%name%/model --instance_prompt="%instance_prompt%" --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=1 --gradient_checkpointing --learning_rate=1e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=%steps% --save_interval=100 --save_sample_prompt="%instance_prompt%"
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `8` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Caching latents: 100%|███████████████████████████████████████████████████████████████████| 5/5 [00:08<00:00,  1.68s/it]
Steps:   0%|                                                                                   | 0/800 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train_dreambooth.py", line 811, in <module>
    main(args)
  File "train_dreambooth.py", line 775, in main
    accelerator.backward(loss)
  File "C:\Users\moish\.conda\envs\diffusers\lib\site-packages\accelerate\accelerator.py", line 884, in backward
    loss.backward(**kwargs)
  File "C:\Users\moish\.conda\envs\diffusers\lib\site-packages\torch\_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "C:\Users\moish\.conda\envs\diffusers\lib\site-packages\torch\autograd\__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "C:\Users\moish\.conda\envs\diffusers\lib\site-packages\torch\autograd\function.py", line 253, in apply
    return user_fn(self, *args)
  File "C:\Users\moish\.conda\envs\diffusers\lib\site-packages\torch\utils\checkpoint.py", line 146, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "C:\Users\moish\.conda\envs\diffusers\lib\site-packages\torch\autograd\__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "C:\Users\moish\.conda\envs\diffusers\lib\site-packages\torch\autograd\function.py", line 253, in apply
    return user_fn(self, *args)
  File "C:\Users\moish\.conda\envs\diffusers\lib\site-packages\xformers\ops.py", line 369, in backward
    ) = torch.ops.xformers.efficient_attention_backward_cutlass(
  File "C:\Users\moish\.conda\envs\diffusers\lib\site-packages\torch\_ops.py", line 143, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: invalid argument
Steps:   0%|                                                                                   | 0/800 [00:05<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\moish\.conda\envs\diffusers\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\moish\.conda\envs\diffusers\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\moish\.conda\envs\diffusers\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Users\moish\.conda\envs\diffusers\lib\site-packages\accelerate\commands\accelerate_cli.py", line 43, in main
    args.func(args)
  File "C:\Users\moish\.conda\envs\diffusers\lib\site-packages\accelerate\commands\launch.py", line 837, in launch_command
    simple_launcher(args)
  File "C:\Users\moish\.conda\envs\diffusers\lib\site-packages\accelerate\commands\launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\moish\\.conda\\envs\\diffusers\\python.exe', 'train_dreambooth.py', '--pretrained_model_name_or_path=./models/sd15', '--instance_data_dir=data/cutedog/images', '--output_dir=data/cutedog/model', '--instance_prompt=a photo of cutedog dog', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--learning_rate=1e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=800', '--save_interval=100', '--save_sample_prompt=a photo of cutedog dog']' returned non-zero exit status 1

Reproduction

Running accelerate launch train_dreambooth.py

Logs

No response

System Info

Windows 11 Pro RTX 3080 mobile 16GB Python 3.8.13

conda env export

name: diffusers
channels:
  - pytorch
  - conda-forge
  - defaults
dependencies:
  - blas=2.116=mkl
  - blas-devel=3.9.0=16_win64_mkl
  - brotlipy=0.7.0=py38h294d835_1004
  - ca-certificates=2022.9.24=h5b45459_0
  - certifi=2022.9.24=pyhd8ed1ab_0
  - cffi=1.15.1=py38hd8c33c5_0
  - charset-normalizer=2.1.1=pyhd8ed1ab_0
  - cryptography=37.0.4=py38hb7941b4_0
  - cudatoolkit=11.6.0=hc0ea762_10
  - freetype=2.12.1=h546665d_0
  - idna=3.4=pyhd8ed1ab_0
  - intel-openmp=2022.1.0=h57928b3_3787
  - jpeg=9e=h8ffe710_2
  - lcms2=2.12=h2a16943_0
  - lerc=3.0=h0e60522_0
  - libblas=3.9.0=16_win64_mkl
  - libcblas=3.9.0=16_win64_mkl
  - libdeflate=1.12=h8ffe710_0
  - liblapack=3.9.0=16_win64_mkl
  - liblapacke=3.9.0=16_win64_mkl
  - libpng=1.6.37=h1d00b33_4
  - libtiff=4.4.0=h2ed3b44_1
  - libuv=1.44.2=h8ffe710_0
  - libwebp-base=1.2.4=h8ffe710_0
  - libxcb=1.13=hcd874cb_1004
  - libzlib=1.2.12=h8ffe710_2
  - lz4-c=1.9.3=h8ffe710_1
  - m2w64-gcc-libgfortran=5.3.0=6
  - m2w64-gcc-libs=5.3.0=7
  - m2w64-gcc-libs-core=5.3.0=7
  - m2w64-gmp=6.1.0=2
  - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
  - mkl=2022.1.0=h6a75c08_874
  - mkl-devel=2022.1.0=h57928b3_875
  - mkl-include=2022.1.0=h6a75c08_874
  - msys2-conda-epoch=20160418=1
  - numpy=1.23.2=py38h223ccf5_0
  - openjpeg=2.5.0=hc9384bd_1
  - openssl=1.1.1q=h8ffe710_0
  - pillow=9.2.0=py38h37aa274_2
  - pip=22.2.2=py38haa95532_0
  - pthread-stubs=0.4=hcd874cb_1001
  - pycparser=2.21=pyhd8ed1ab_0
  - pyopenssl=22.0.0=pyhd8ed1ab_1
  - pysocks=1.7.1=pyh0701188_6
  - python=3.8.13=h6244533_1
  - python_abi=3.8=2_cp38
  - pytorch=1.12.1=py3.8_cuda11.6_cudnn8_0
  - pytorch-mutex=1.0=cuda
  - requests=2.28.1=pyhd8ed1ab_1
  - sqlite=3.39.3=h2bbff1b_0
  - tbb=2021.5.0=h2d74725_1
  - tk=8.6.12=h8ffe710_0
  - torchvision=0.13.1=py38_cu116
  - typing_extensions=4.4.0=pyha770c72_0
  - urllib3=1.26.11=pyhd8ed1ab_0
  - vc=14.2=h21ff451_1
  - vs2015_runtime=14.27.29016=h5e58377_2
  - wheel=0.37.1=pyhd3eb1b0_0
  - win_inet_pton=1.1.0=pyhd8ed1ab_6
  - wincertstore=0.2=py38haa95532_2
  - xorg-libxau=1.0.9=hcd874cb_0
  - xorg-libxdmcp=1.1.3=hcd874cb_0
  - xz=5.2.6=h8d14728_0
  - zstd=1.5.2=h6255e5f_4
  - pip:
    - absl-py==1.3.0
    - accelerate==0.12.0
    - aiohttp==3.8.3
    - aiosignal==1.2.0
    - antlr4-python3-runtime==4.9.3
    - async-timeout==4.0.2
    - attrs==22.1.0
    - bitsandbytes==0.35.4
    - cachetools==5.2.0
    - colorama==0.4.6
    - diffusers==0.7.0.dev0
    - filelock==3.8.0
    - fire==0.4.0
    - frozenlist==1.3.1
    - fsspec==2022.10.0
    - ftfy==6.1.1
    - google-auth==2.14.0
    - google-auth-oauthlib==0.4.6
    - grpcio==1.50.0
    - huggingface-hub==0.10.1
    - importlib-metadata==5.0.0
    - jinja2==3.1.2
    - lightning-lite==1.8.0.post1
    - lightning-utilities==0.3.0
    - markdown==3.4.1
    - markupsafe==2.1.1
    - modelcards==0.1.6
    - multidict==6.0.2
    - mypy-extensions==0.4.3
    - oauthlib==3.2.2
    - omegaconf==2.2.3
    - packaging==21.3
    - protobuf==3.19.6
    - psutil==5.9.3
    - pyasn1==0.4.8
    - pyasn1-modules==0.2.8
    - pyparsing==3.0.9
    - pyre-extensions==0.0.23
    - pytorch-lightning==1.8.0.post1
    - pyyaml==6.0
    - regex==2022.10.31
    - requests-oauthlib==1.3.1
    - rsa==4.9
    - setuptools==59.5.0
    - six==1.16.0
    - tensorboard==2.10.1
    - tensorboard-data-server==0.6.1
    - tensorboard-plugin-wit==1.8.1
    - termcolor==2.1.0
    - tokenizers==0.12.1
    - torchmetrics==0.10.2
    - tqdm==4.64.1
    - transformers==4.21.0
    - typing-inspect==0.8.0
    - wcwidth==0.2.5
    - werkzeug==2.2.2
    - xformers==0.0.14.dev0
    - yarl==1.8.1
    - zipp==3.10.0

diffusers-cli env

- `diffusers` version: 0.7.0.dev0
- Platform: Windows-10-10.0.22000-SP0
- Python version: 3.8.13
- PyTorch version (GPU?): 1.12.1 (True)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.21.0
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No

nvidia-smi NVIDIA-SMI 522.30 Driver Version: 522.30 CUDA Version: 11.8

ShivamShrirao / diffusers