ShivamShrirao / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch
https://huggingface.co/docs/diffusers
Apache License 2.0
1.88k stars 506 forks source link

Attempting to unscale FP16 gradients #131

Open cian0 opened 1 year ago

cian0 commented 1 year ago

Describe the bug

The script wouldn't start the training steps due to the error in the title

Reproduction

No response

Logs

Steps:   0%|                                                                                       | 0/800 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/workspace/sdw/examples/dreambooth/train_dreambooth.py", line 812, in <module>
    main(args)
  File "/workspace/sdw/examples/dreambooth/train_dreambooth.py", line 784, in main
    optimizer.step()
  File "/opt/conda/lib/python3.9/site-packages/accelerate/optimizer.py", line 134, in step
    self.scaler.step(self.optimizer, closure)
  File "/opt/conda/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 337, in step
    self.unscale_(optimizer)
  File "/opt/conda/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 282, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/opt/conda/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 210, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

System Info

my pip list: absl-py 1.3.0 accelerate 0.14.0 aiohttp 3.8.3 aiosignal 1.2.0 anyio 3.6.2 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 asttokens 2.0.5 astunparse 1.6.3 async-timeout 4.0.2 attrs 22.1.0 awscli 1.27.8 Babel 2.11.0 backcall 0.2.0 bash_kernel 0.8.0 bcrypt 4.0.1 beautifulsoup4 4.11.1 bitsandbytes 0.35.4 bleach 5.0.1 botocore 1.29.8 brotlipy 0.7.0 cachetools 5.2.0 certifi 2022.9.24 cffi 1.15.0 chardet 4.0.0 charset-normalizer 2.0.4 click 8.1.3 cmake 3.24.3 colorama 0.4.4 conda 22.9.0 conda-build 3.22.0 conda-content-trust 0+unknown conda-package-handling 1.8.1 contourpy 1.0.6 cryptography 36.0.0 cycler 0.11.0 debugpy 1.6.3 decorator 5.1.1 defusedxml 0.7.1 diffusers 0.8.0.dev0 docutils 0.16 entrypoints 0.4 exceptiongroup 1.0.0 executing 0.8.3 expecttest 0.1.4 fastapi 0.86.0 fastjsonschema 2.16.2 ffmpy 0.3.0 filelock 3.6.0 fonttools 4.38.0 frozenlist 1.3.1 fsspec 2022.10.0 ftfy 6.1.1 future 0.18.2 glob2 0.7 google-auth 2.14.1 google-auth-oauthlib 0.4.6 gradio 3.9 grpcio 1.50.0 h11 0.12.0 httpcore 0.15.0 httpx 0.23.0 huggingface-hub 0.10.1 hypothesis 6.56.4 idna 3.3 importlib-metadata 5.0.0 iniconfig 1.1.1 ipykernel 6.17.1 ipython 8.4.0 ipython-genutils 0.2.0 ipywidgets 8.0.2 jedi 0.18.1 Jinja2 3.1.2 jmespath 1.0.1 json5 0.9.10 jsonschema 4.17.0 jupyter 1.0.0 jupyter-archive 3.3.2 jupyter_client 7.4.5 jupyter-console 6.4.4 jupyter_core 5.0.0 jupyter-http-over-ws 0.0.8 jupyter-server 1.23.2 jupyterlab 3.5.0 jupyterlab-pygments 0.2.2 jupyterlab_server 2.16.3 jupyterlab-widgets 3.0.3 kiwisolver 1.4.4 libarchive-c 2.9 linkify-it-py 1.0.3 Markdown 3.4.1 markdown-it-py 2.1.0 MarkupSafe 2.1.1 matplotlib 3.6.2 matplotlib-inline 0.1.6 mdit-py-plugins 0.3.1 mdurl 0.1.2 mistune 2.0.4 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 modelcards 0.1.6 mpmath 1.2.1 multidict 6.0.2 mypy-extensions 0.4.3 natsort 8.2.0 nbclassic 0.4.8 nbclient 0.7.0 nbconvert 7.2.5 nbformat 5.7.0 nbzip 0.1.0 nest-asyncio 1.5.6 notebook 6.5.2 notebook_shim 0.2.2 numpy 1.22.3 oauthlib 3.2.2 orjson 3.8.1 packaging 21.3 pandas 1.5.1 pandocfilters 1.5.0 paramiko 2.12.0 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.0.1 pip 21.2.4 pkginfo 1.8.3 platformdirs 2.5.4 pluggy 1.0.0 prometheus-client 0.15.0 prompt-toolkit 3.0.20 protobuf 3.20.3 psutil 5.8.0 ptyprocess 0.7.0 pure-eval 0.2.2 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycosat 0.6.3 pycparser 2.21 pycryptodome 3.15.0 pydantic 1.10.2 pydub 0.25.1 Pygments 2.11.2 PyNaCl 1.5.0 pyOpenSSL 22.0.0 pyparsing 3.0.9 pyre-extensions 0.0.23 pyrsistent 0.19.2 PySocks 1.7.1 pytest 7.2.0 python-dateutil 2.8.2 python-multipart 0.0.5 pytz 2022.1 PyYAML 5.4.1 pyzmq 24.0.1 qtconsole 5.4.0 QtPy 2.3.0 regex 2022.10.31 requests 2.27.1 requests-oauthlib 1.3.1 rfc3986 1.5.0 rsa 4.7.2 ruamel-yaml-conda 0.15.100 s3transfer 0.6.0 Send2Trash 1.8.0 setuptools 61.2.0 six 1.16.0 sniffio 1.3.0 sortedcontainers 2.4.0 soupsieve 2.3.2.post1 stack-data 0.2.0 starlette 0.20.4 sympy 1.11.1 tensorboard 2.11.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 terminado 0.17.0 tinycss2 1.2.1 tokenizers 0.13.1 toml 0.10.2 tomli 2.0.1 toolz 0.11.2 torch 1.13.0 torchtext 0.14.0 torchvision 0.14.0 tornado 6.2 tqdm 4.63.0 traitlets 5.5.0 transformers 4.24.0 triton 2.0.0.dev20221105 types-dataclasses 0.6.6 typing_extensions 4.4.0 typing-inspect 0.8.0 uc-micro-py 1.0.1 urllib3 1.26.8 uvicorn 0.19.0 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 1.4.2 websockets 10.4 Werkzeug 2.2.2 wheel 0.37.1 widgetsnbextension 4.0.3 xformers 0.0.14.dev0 yarl 1.8.1 zipp 3.10.0

I've tried in vast ai with these machines: RTX 3090 CUDA 11.4

A6000 CUDA 11.7

gadicc commented 1 year ago

According to the user report at https://github.com/huggingface/diffusers/issues/1246 it's a recently introduced bug in diffusers.