Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.39k stars 3.38k forks source link

When interrupting a run with Ctrl+C, sometimes the WandbLogger does not upload a checkpoint artifact #20425

Open edmcman opened 9 hours ago

edmcman commented 9 hours ago

Bug description

When interrupting a run with Ctrl+C, the WandbLogger does not upload a checkpoint artifact

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

Epoch 20:  28%|██▏     | 6502/23178 [29:11<1:14:53,  3.71it/s, v_num=gwj7, train_loss=nan.0]^C
Detected KeyboardInterrupt, attempting graceful shutdown ...
wandb: 🚀 View run train-release-0.1 at: https://wandb.ai/eschwartz/dire/runs/uvexgwj7
Epoch 20:  28%|██▏     | 6502/23178 [29:16<1:15:04,  3.70it/s, v_num=gwj7, train_loss=nan.0]

Environment

Current environment * CUDA: - GPU: - NVIDIA GeForce RTX 4070 Laptop GPU - available: True - version: 12.1 * Lightning: - lightning-utilities: 0.11.7 - pytorch-lightning: 2.4.0 - torch: 2.3.0 - torchmetrics: 1.6.0 * Packages: - absl-py: 2.1.0 - aiohappyeyeballs: 2.4.3 - aiohttp: 3.10.10 - aiosignal: 1.3.1 - appdirs: 1.4.4 - asttokens: 2.4.1 - async-timeout: 4.0.3 - attrs: 23.2.0 - braceexpand: 0.1.7 - certifi: 2024.2.2 - charset-normalizer: 3.3.2 - click: 8.1.7 - decorator: 5.1.1 - docker-pycreds: 0.4.0 - docopt: 0.6.2 - editdistance: 0.5.3 - et-xmlfile: 1.1.0 - exceptiongroup: 1.2.2 - executing: 2.1.0 - filelock: 3.13.4 - frozenlist: 1.5.0 - fsspec: 2024.3.1 - future: 1.0.0 - gitdb: 4.0.11 - gitpython: 3.1.43 - grpcio: 1.62.2 - hjson: 3.1.0 - idna: 3.7 - ipdb: 0.13.13 - ipython: 8.27.0 - jedi: 0.19.1 - jep: 4.2.0 - jinja2: 3.1.3 - jsonlines: 4.0.0 - jsonnet: 0.16.0 - lightning-utilities: 0.11.7 - markdown: 3.6 - markdown-it-py: 2.2.0 - markupsafe: 2.1.5 - matplotlib-inline: 0.1.7 - mdurl: 0.1.2 - mpmath: 1.3.0 - msgpack: 1.0.8 - multidict: 6.1.0 - networkx: 3.3 - numpy: 1.26.4 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.20.5 - nvidia-nvjitlink-cu12: 12.4.127 - nvidia-nvtx-cu12: 12.1.105 - objectio: 0.2.29 - openpyxl: 3.1.2 - packaging: 24.1 - pandas: 2.2.2 - parso: 0.8.4 - pexpect: 4.9.0 - pillow: 10.3.0 - pip: 22.0.2 - platformdirs: 4.3.6 - prompt-toolkit: 3.0.47 - propcache: 0.2.0 - protobuf: 4.25.3 - psutil: 5.9.8 - ptyprocess: 0.7.0 - pure-eval: 0.2.3 - pyelftools: 0.31 - pygments: 2.6.1 - python-dateutil: 2.9.0.post0 - pytorch-lightning: 2.4.0 - pytz: 2024.1 - pyyaml: 6.0.1 - requests: 2.31.0 - rich: 13.2.0 - sentencepiece: 0.1.99 - sentry-sdk: 2.0.1 - setproctitle: 1.3.3 - setuptools: 59.6.0 - shellingham: 1.5.4 - simplejson: 3.19.2 - six: 1.16.0 - smmap: 5.0.1 - stack-data: 0.6.3 - sympy: 1.12 - tensorboard: 2.16.2 - tensorboard-data-server: 0.7.2 - tomli: 2.0.1 - torch: 2.3.0 - torchmetrics: 1.6.0 - tqdm: 4.66.2 - traitlets: 5.14.3 - triton: 2.3.0 - typer: 0.12.3 - typing-extensions: 4.11.0 - tzdata: 2024.1 - ujson: 3.2.0 - urllib3: 2.2.1 - wandb: 0.18.6 - wcwidth: 0.2.13 - webdataset: 0.2.100 - werkzeug: 3.0.2 - yarl: 1.16.0 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.12 - release: 6.8.0-48-generic - version: #48~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 7 11:24:13 UTC 2

More info

No response

edmcman commented 4 hours ago

So I just did the same thing and it did upload a checkpoint artifact. I'm going to close this for now under the assumption that I accidentally hit Ctrl+C twice or something like that.

edmcman commented 3 hours ago

Just happened again. I definitely did not press it twice.