Open TheAeryan opened 4 months ago
I have also encountered this problem.In my case, it was caused by the increase in dataset size. Pretrain: Each epoch consists of 100 iterations. Finetune: Each epoch consists of 120 iterations. After pretrain n epoch, fine-tuning commences at training_step epoch_n it_100. PL log 'xxx_epoch' between the invocation of callbacks. ModelCheckpoint and training_step epoch_n+1 it_0
Bug description
I have a model with several
ModelCheckpoint
callbacks. When loading it from a checkpoint usingtrainer.fit(model, datamodule=dm, ckpt_path=training_ckpt_path)
, I get the following error:The issue seems to be that the
v_nll_unsupervised
metric was not logged with thelog(...)
method, so theModelCheckpoint
callback can't find it. However, although I don't log this metric at every validation step, it is logged at least once every validation epoch. Since I useon_step=False, on_epoch=True
when logging metrics, I would expect that the whole validation epoch would end before theModelCheckpoint
callback tries to access this metric, in which case it would exist and no error would be raised. Nonetheless, it seems this metric is being accessed just after the first validation iteration.I thought that maybe this was due to the sanity checking process when training starts. However, setting
num_sanity_val_steps=0
ornum_sanity_val_steps=-1
in theTrainer
did not solve anything.What version are you seeing the problem on?
v2.1
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
* CUDA: - GPU: - Tesla V100-PCIE-16GB - Tesla V100-PCIE-16GB - available: True - version: 11.7 * Lightning: - lightning-cloud: 0.5.37 - lightning-utilities: 0.8.0 - pytorch-lightning: 2.1.0 - pytorch-ranger: 0.1.1 - torch: 2.0.1 - torch-optimizer: 0.3.0 - torch-scatter: 2.1.1 - torchmetrics: 0.11.4 * Packages: - absl-py: 1.4.0 - aiohttp: 3.8.4 - aiosignal: 1.3.1 - ansicolors: 1.1.8 - antlr4-python3-runtime: 4.7.2 - anyio: 3.7.1 - arrow: 1.2.3 - async-timeout: 4.0.2 - attrs: 23.1.0 - backoff: 2.2.1 - beautifulsoup4: 4.12.2 - blessed: 1.20.0 - boto: 2.49.0 - cachetools: 5.3.1 - certifi: 2023.5.7 - charset-normalizer: 3.1.0 - click: 8.1.3 - cmake: 3.26.4 - contourpy: 1.1.0 - croniter: 1.4.1 - cycler: 0.11.0 - dateutils: 0.6.12 - deepdiff: 6.3.1 - exceptiongroup: 1.1.2 - fastapi: 0.100.0 - filelock: 3.12.2 - fonttools: 4.40.0 - frozenlist: 1.3.3 - fsspec: 2023.6.0 - google-auth: 2.20.0 - google-auth-oauthlib: 1.0.0 - gprof2dot: 2022.7.29 - graphviz: 0.20.1 - grpcio: 1.51.3 - h11: 0.14.0 - idna: 3.4 - importlib-metadata: 6.7.0 - importlib-resources: 5.12.0 - inquirer: 3.1.3 - itsdangerous: 2.1.2 - jinja2: 3.1.2 - joblib: 1.2.0 - jsonschema: 4.17.3 - kiwisolver: 1.4.4 - lifted-pddl: 1.2.2 - lightning-cloud: 0.5.37 - lightning-utilities: 0.8.0 - lit: 16.0.6 - markdown: 3.4.3 - markdown-it-py: 3.0.0 - markupsafe: 2.1.3 - matplotlib: 3.7.1 - mdurl: 0.1.2 - mpmath: 1.3.0 - msgpack: 1.0.5 - multidict: 6.0.4 - multipledispatch: 0.6.0 - mypy: 1.3.0 - mypy-extensions: 1.0.0 - networkx: 3.1 - numpy: 1.25.0 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cuda-cupti-cu11: 11.7.101 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: 8.5.0.96 - nvidia-cufft-cu11: 10.9.0.58 - nvidia-curand-cu11: 10.2.10.91 - nvidia-cusolver-cu11: 11.4.0.1 - nvidia-cusparse-cu11: 11.7.4.91 - nvidia-nccl-cu11: 2.14.3 - nvidia-nvtx-cu11: 11.7.91 - oauthlib: 3.2.2 - ordered-set: 4.1.0 - packaging: 23.1 - pandas: 2.0.2 - pddl-generators: 1.0 - pillow: 9.5.0 - pip: 23.1.2 - protobuf: 4.23.3 - psutil: 5.9.5 - pyarrow: 12.0.1 - pyasn1: 0.5.0 - pyasn1-modules: 0.3.0 - pydantic: 1.10.11 - pygments: 2.15.1 - pyjwt: 2.7.0 - pynvml: 11.5.0 - pyparsing: 3.1.0 - pyperplan: 2.1 - pyrsistent: 0.19.3 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-multipart: 0.0.6 - pytorch-lightning: 2.1.0 - pytorch-ranger: 0.1.1 - pytz: 2023.3 - pyyaml: 6.0 - ray: 2.5.0 - readchar: 4.0.5 - requests: 2.31.0 - requests-oauthlib: 1.3.1 - rich: 13.4.2 - rsa: 4.9 - scikit-learn: 1.2.2 - scipy: 1.10.1 - seaborn: 0.12.2 - setuptools: 67.7.2 - six: 1.16.0 - snakeviz: 2.2.0 - sniffio: 1.3.0 - soupsieve: 2.4.1 - stable-trunc-gaussian: 1.3.9 - starlette: 0.27.0 - starsessions: 1.3.0 - strips-hgn: 1.0 - sympy: 1.12 - tarski: 0.8.2 - tensorboard: 2.16.2 - tensorboard-data-server: 0.7.1 - tensorboardx: 2.6.1 - threadpoolctl: 3.1.0 - tomli: 2.0.1 - torch: 2.0.1 - torch-optimizer: 0.3.0 - torch-scatter: 2.1.1 - torchmetrics: 0.11.4 - tornado: 6.3.3 - tqdm: 4.65.0 - traitlets: 5.9.0 - triton: 2.0.0 - typing-extensions: 4.6.3 - tzdata: 2023.3 - urllib3: 1.26.16 - uvicorn: 0.23.0 - wcwidth: 0.2.6 - websocket-client: 1.6.1 - websockets: 11.0.3 - werkzeug: 2.3.6 - wheel: 0.40.0 - yarl: 1.9.2 - z3: 0.2.0 - zipp: 3.15.0 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.9.16 - release: 5.4.0-174-generic - version: #193-Ubuntu SMP Thu Mar 7 14:29:28 UTC 2024More info
No response
cc @carmocca @awaelchli