Open schopra8 opened 4 months ago
There were similar issues reported a few years back -- https://github.com/Lightning-AI/pytorch-lightning/issues/7792
And they were seem to be solved -- https://github.com/Lightning-AI/pytorch-lightning/pull/7975
So not sure if the bug was re-introduced in subsequent years OR if I'm missing something in my example code.
Bug description
I'm using 2 optimizers and trying to train with AMP (FP16). I can take steps with my first optimizer. When I take my first step with the second optimizer I get the following error:
I can train this correctly in FP32 -- so it seems to be an issue with AMP.
What version are you seeing the problem on?
version 2.3.3
How to reproduce the bug
Error messages and logs
Environment
Current environment
``` * CUDA: - GPU: - NVIDIA A100-SXM4-80GB - available: True - version: 12.1 * Lightning: - lightning-utilities: 0.11.5 - pytorch-lightning: 2.3.3 - torch: 2.3.1 - torchmetrics: 1.4.0.post0 - torchvision: 0.18.1 * Packages: - aiohttp: 3.9.5 - aiosignal: 1.3.1 - annotated-types: 0.7.0 - antlr4-python3-runtime: 4.9.3 - anyio: 4.4.0 - argon2-cffi: 23.1.0 - argon2-cffi-bindings: 21.2.0 - arrow: 1.3.0 - asttokens: 2.4.1 - async-lru: 2.0.4 - async-timeout: 4.0.3 - attrs: 23.2.0 - autocommand: 2.2.2 - babel: 2.15.0 - backports.tarfile: 1.2.0 - beautifulsoup4: 4.12.3 - bitsandbytes: 0.43.1 - bleach: 6.1.0 - boto3: 1.34.144 - botocore: 1.34.144 - braceexpand: 0.1.7 - certifi: 2024.7.4 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.20.5 - nvidia-nvjitlink-cu12: 12.5.82 - nvidia-nvtx-cu12: 12.1.105 - omegaconf: 2.3.0 - opencv-python: 4.10.0.84 - ordered-set: 4.1.0 - overrides: 7.7.0 - packaging: 24.1 - pandocfilters: 1.5.1 - parso: 0.8.4 - pexpect: 4.9.0 - pillow: 10.4.0 - pip: 24.1 - platformdirs: 4.2.2 - pre-commit: 3.7.1 - proglog: 0.1.10 - prometheus-client: 0.20.0 - prompt-toolkit: 3.0.47 - protobuf: 5.27.2 - psutil: 6.0.0 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - pycparser: 2.22 - pydantic: 2.8.2 - pydantic-core: 2.20.1 - pydantic-settings: 2.3.4 - pygments: 2.18.0 - python-dateutil: 2.9.0.post0 - python-dotenv: 1.0.1 - python-json-logger: 2.0.7 - pytorch-lightning: 2.3.3 - pyyaml: 6.0.1 - pyzmq: 26.0.3 - referencing: 0.35.1 - requests: 2.32.3 - rfc3339-validator: 0.1.4 - rfc3986-validator: 0.1.1 - rpds-py: 0.19.0 - s3transfer: 0.10.2 - send2trash: 1.8.3 - sentry-sdk: 2.10.0 - setproctitle: 1.3.3 - setuptools: 71.0.2 - six: 1.16.0 - smmap: 5.0.1 - sniffio: 1.3.1 - soupsieve: 2.5 - stack-data: 0.6.3 - sympy: 1.13.0 - terminado: 0.18.1 - tinycss2: 1.3.0 - tomli: 2.0.1 - torch: 2.3.1 - torchmetrics: 1.4.0.post0 - torchvision: 0.18.1 - tornado: 6.4.1 - tqdm: 4.66.4 - traitlets: 5.14.3 - triton: 2.3.1 - typeguard: 4.3.0 - types-python-dateutil: 2.9.0.20240316 - typing-extensions: 4.12.2 - uri-template: 1.3.0 - urllib3: 2.2.2 - virtualenv: 20.26.3 - wandb: 0.17.4 - wcwidth: 0.2.13 - webcolors: 24.6.0 - webdataset: 0.2.86 - webencodings: 0.5.1 - websocket-client: 1.8.0 - wheel: 0.43.0 - yarl: 1.9.4 - zipp: 3.19.2 * System: - OS: Linux - architecture: - 64bit - ELF - processor: - python: 3.10.14 - release: 5.10.0-31-cloud-amd64 - version: #1 SMP Debian 5.10.221-1 (2024-07-14) ```More info
No response