Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.23k stars 3.38k forks source link

How to do a full validation loop at iteration 0? #4170

Closed jonaskohler closed 3 years ago

jonaskohler commented 4 years ago

Hi all

I have two issues with logging in lightning: First, I would like to run a full validation loop in the beginning to get the validation loss at random initialziation.

Second, I set "log_every_n_steps=100" for the pl.Trainer and would have expected the trainer to log also at initialization 0 but it starts at 100.

Thus, I cannot see how the network behaves at initialization neither on the train nor the validation set. I tried to find a fix in the documentation but didn't succeed.

Any help is highly appreciated :)

github-actions[bot] commented 4 years ago

Hi! thanks for your contribution!, great first issue!

ydcjeff commented 4 years ago

First, if you would like to run validation before training in the first, you can use https://pytorch-lightning.readthedocs.io/en/latest/trainer.html#num-sanity-val-steps

Second, setting up log_every_n_steps=100 will only log at every 100 steps if you would like to log at 0, you can check batch_idx==0 in training/validation_step and log what you want.

rohitgr7 commented 4 years ago

will it makes sense to allow logging at global_step == 0?

williamFalcon commented 4 years ago

will it makes sense to allow logging at global_step == 0?

not sure... what does everyone think?

jonaskohler commented 4 years ago

Thanks for your answers. For me (speaking from a research perspective) it is definitely useful to know what's happening right at initialization. For example when examining different initialization schemes.

williamFalcon commented 4 years ago

yeah, i don’t think i mind logging batch 0.

@ananthsub?

jonaskohler commented 4 years ago

Please allow me a follow up question: When passing num_sanity_val_steps=-1 to the trainer, the method validation_epoch_end is called at iteration 0 but for some reason nothing is logged. I tried both self.log(..) as well as logs={...}, return 'log':logs.

As soon as the first epoch is over, everything logs as normal. What do I need to do to trigger the first log to show up on tensorboard? Thanks in advance

ananthsub commented 4 years ago

Logging at global step 0 sounds good to me. I think it'd simplify the logging internals too from (global step + 1) % log_every_n_steps to just (global step) % log_every_n_steps.

rohitgr7 commented 4 years ago

(global step) % log_every_n_steps.

global_step is indexed at 0, so if log_every_n_steps = 10, it will log at 1, 11, 21 training_step .

eladar commented 3 years ago

Hi, I've somewhat similar question - Is there any way to log the first training epoch (or several iteration) without doing optimization? I want to get baseline for the initialized model and compare loss values both on training and validation samples.

Thanks in advance, -ea

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!