Closed BoringDonut closed 1 year ago
I think this is a duplicate of https://github.com/Lightning-AI/lightning/issues/18394. Can you confirm?
I think this is a duplicate of #18394. Can you confirm?
Yes, indeed. Sorry for that
Dublicate and fixed with https://github.com/Lightning-AI/lightning/pull/18854
Bug description
Using
BatchSizeFinder
seems to limit number of validation batches toBatchSizeFinder._steps_per_trial
.This results in val set being equal to few dozens samples and inadequate metrics being produced.
It seems it can be fixed by calling to
_reset_dataloaders
one additional timeWhat version are you seeing the problem on?
v1.8, v2.0
How to reproduce the bug
Error messages and logs
Here is log that shows a number of validated samples for each epoch. Val ds size: 123, num epochs: 2, batch size: 2 (see code above)
As you can see first and last runs validated all 123 samples twice, while second run (with default
BatchSizeFinder
) only validated 6 samples on both epochs. Here6 = steps_per_trial * BATCH_SIZE = 3 * 2
.Environment
Current environment
* CUDA: - GPU: - NVIDIA GeForce RTX 3050 Laptop GPU - available: True - version: 12.1 * Lightning: - lightning: 2.1.0 - lightning-utilities: 0.9.0 - pytorch-lightning: 2.1.0 - torch: 2.1.0 - torchmetrics: 1.2.0 * Packages: - aiohttp: 3.8.6 - aiosignal: 1.3.1 - async-timeout: 4.0.3 - attrs: 23.1.0 - certifi: 2023.7.22 - charset-normalizer: 3.3.0 - filelock: 3.12.4 - frozenlist: 1.4.0 - fsspec: 2023.9.2 - idna: 3.4 - jinja2: 3.1.2 - lightning: 2.1.0 - lightning-utilities: 0.9.0 - markupsafe: 2.1.3 - mpmath: 1.3.0 - multidict: 6.0.4 - networkx: 3.1 - numpy: 1.24.4 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.18.1 - nvidia-nvjitlink-cu12: 12.2.140 - nvidia-nvtx-cu12: 12.1.105 - packaging: 23.2 - pip: 23.2.1 - pytorch-lightning: 2.1.0 - pyyaml: 6.0.1 - requests: 2.31.0 - setuptools: 68.1.2 - sympy: 1.12 - torch: 2.1.0 - torchmetrics: 1.2.0 - tqdm: 4.66.1 - triton: 2.1.0 - typing-extensions: 4.8.0 - urllib3: 2.0.7 - wheel: 0.41.2 - yarl: 1.9.2 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.8.18 - release: 5.15.0-83-generic - version: #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023More info
@tanaymeh can you maybe add related fix to #18826 ? It seems to be related to the sample parts of code and only require a few additional lines.