Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.25k stars 3.38k forks source link

load data sequence is confusing #20358

Open workhours opened 3 days ago

workhours commented 3 days ago

Bug description

I understand data consuming sequence in lightning is: 1, sanity check: call val_dataloader 2, training: call train_dataloader 3, validate: call val_dataloader from above sequence I understand the cycle of a epoch is start from val_dataloader and end at train_dataloader, and the 3rd validate reuse val data from 1st val_dataloader. but if if you check trainer.current_epoch: assume current_epoch is 1 at sanity check val_dataloader, then it increased to 2 at train_dataloader. in thise case it's seems the cycle of a epoch is start from train_dataloader and end at val_dataloader. in this situation will confuse how to write code in val_dataloader when dynamic loading data. if infinite epoch, no problem. but at last epoch(I don't know now it's last one), should I ignore val_data is None or should I try to load it as if next round of cycle?

I think sanitcy check logic and validate logic should merge as one data-setup, but used twice for difference purpose. twice call val_dataloader and once call training_dataloader also make difficult to manage data load

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment ``` #- PyTorch Lightning Version (e.g., 2.4.0): #- PyTorch Version (e.g., 2.4): #- Python version (e.g., 3.12): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): ```

More info

No response

workhours commented 3 days ago

sorry for submit many times since damned firewall. btw, fit_loop.on_run_start and on_advanced_start has twice setup_data, why? is advanced_start the real start of train?

workhours commented 3 days ago

the simple scenario is if user want feed data for next epoch, just give a one-callable interface. once for all types of data(val, train,test,predict...) which give a clear message: if call again, it's must request data for next epoch. not the framework is very very flexible that it's difficult to write data moving logic in val_dataloader, train_dataloader, etc.. or the framework should provide a clear notification that the current epoch is ended, no more data request.