Closed athn-nik closed 1 year ago
@athn-nik Is the validation loop running for you? Here are some reasons why validation may not run:
Can you check or share your Trainer settings, ideally the full script?
@awaelchli thanks for grabbing this. limit_val_batches is always > 0. Here are the args:
auto_select_gpus: true
strategy: null # 'ddp' for multi gpu
benchmark: False
max_epochs: 1001
accelerator: gpu
devices: 1
log_every_n_steps: 1
deterministic: False
detect_anomaly: False
enable_progress_bar: True
check_val_every_n_epoch: 25
limit_train_batches: 1.0
limit_val_batches: 1.0
num_sanity_val_steps: 2
I have a dataloader and all the other validation relevant metrics and losses are calculated and logged in wandb and stdout. Trainer code:
class BaseModel(LightningModule):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.save_hyperparameters(logger=False)
# Save visuals, one validation step per validation epoch
self.store_examples = {"train": None,
"val": None}
# Need to define:
# forward
# allsplit_step()
# metrics()
# losses()
def __post_init__(self):
trainable, nontrainable = 0, 0
for p in self.parameters():
if p.requires_grad:
trainable += np.prod(p.size())
else:
nontrainable += np.prod(p.size())
self.hparams.n_params_trainable = trainable
self.hparams.n_params_nontrainable = nontrainable
def training_step(self, batch, batch_idx):
return self.allsplit_step("train", batch, batch_idx)
def validation_step(self, batch, batch_idx):
return self.allsplit_step("val", batch, batch_idx)
def test_step(self, batch, batch_idx):
return self.allsplit_step("test", batch, batch_idx)
def allsplit_epoch_end(self, split: str, outputs):
loss_tracker = self.tracker[split]
loss_dict = loss_tracker.compute()
loss_tracker.reset()
dico = {loss_tracker.loss2logname(loss, split): value.item()
for loss, value in loss_dict.items()}
# workaround for LR, assuming 1 optimizer, 1 scheduler, very weak
curr_lr = self.trainer.optimizers[0].param_groups[0]['lr']
dico.update({'Learning Rate': curr_lr})
dico.update({"epoch": float(self.trainer.current_epoch),
"step": float(self.trainer.current_epoch)})
if split == "val":
metrics_dict = self.metrics.compute()
dico.update({f"Metrics/{metric}": value for metric, value in metrics_dict.items() if '_mean_' in metric})
self.log_dict(dico)
def training_epoch_end(self, outputs):
return self.allsplit_epoch_end("train", outputs)
def validation_epoch_end(self, outputs):
return self.allsplit_epoch_end("val", outputs)
def test_epoch_end(self, outputs):
return self.allsplit_epoch_end("test", outputs)
def configure_optimizers(self):
optim_dict = {}
optimizer = instantiate(self.hparams.optim, params=self.parameters())
optim_dict['optimizer'] = optimizer
if self.hparams.lr_scheduler == 'reduceonplateau':
optim_dict['lr_scheduler'] = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, threshold=1e-3)
optim_dict['monitor'] = 'losses/total/train'
elif self.hparams.lr_scheduler == 'steplr':
optim_dict['lr_scheduler'] = torch.optim.lr_scheduler.StepLR(optimizer, step_size=100)
return optim_dict
The relevant all_split_step which is called is model dependent and an example could be:
def allsplit_step(self, split: str, batch, batch_idx):
# Prepare the generated motion features
length = batch["length"]
input_motion_feats = batch["datastruct"]
total_loss, loss_dict = self.losses[split](...)
if split == 'val':
self.metrics(input_motion_feats.detach().joints,
output_features_T.detach().joints,
length)
if batch_idx == 0:
nvids = self.hparams.nvids_to_save
if nvids is not None and nvids != 0:
del self.store_examples[split]
lengths = batch['length'][:nvids]
keyids = batch['keyid'][:nvids]
motion_features = batch['datastruct']
def prepare_pos(x):
x = x.detach().joints[:nvids]
x = x.cpu().numpy()
return remove_padding(x, lengths)
def prepare_verts(x):
x = x.detach().vertices[:nvids]
x = x.cpu().numpy()
return remove_padding(x, lengths)
self.store_examples[split] = { "text": batch["text"][:nvids] }
self.store_examples[split].update({
'ref': prepare_pos(input_motion_feats),
'ref_features': motion_features.detach(),
'keyids': keyids
})
self.tracker[split].update(loss_dict)
return total_loss
Where the stote_examples is what is used from the callback when the epoch ends.
Are you passing the dataloader correctly? Like
trainer.fit(model, train_dataloader, val_dataloader)
Yes, I do this and the dataloader is implemented via LightningDataModule. The callback is not called as it does not enter the validation_on_epoch_end
method of the callback. It does so when I use a single GPU though, and also the same method is normally accesed for the base model attached above.
@athn-nik I cannot reproduce this. Here is a runnable example based on your configuration (but I removed all code that was incomplete for me to use).
import os
import torch
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer, Callback
class RenderCallback(Callback):
def on_train_epoch_end(self, trainer: Trainer, pl_module: LightningModule,
**kwargs) -> None:
if trainer.is_global_zero:
print("on train epoch end in callback")
def on_validation_epoch_end(self, trainer: Trainer,
pl_module: LightningModule) -> None:
if trainer.is_global_zero:
# return self.call_renderer("val", trainer, pl_module)
print("on val epoch end in callback")
def on_test_epoch_end(self, trainer: Trainer,
pl_module: LightningModule) -> None:
# return self.call_renderer("test", trainer, pl_module)
print("on test epoch end in callback")
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("valid_loss", loss)
def test_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("test_loss", loss)
def configure_optimizers(self):
return Adam(self.parameters())
def training_epoch_end(self, outputs):
print("train epoch end on epoch", self.current_epoch)
def validation_epoch_end(self, outputs):
print("val epoch end on epoch", self.current_epoch)
def test_epoch_end(self, outputs):
print("test epoch end")
def run():
train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
test_data = DataLoader(RandomDataset(32, 64), batch_size=2)
model = BoringModel()
trainer = Trainer(
# auto_select_gpus=True
strategy="ddp",
benchmark=False,
max_epochs=1001,
accelerator="cpu",
devices=2,
log_every_n_steps=1,
deterministic=False,
detect_anomaly=False,
enable_progress_bar=False,
check_val_every_n_epoch=25,
limit_train_batches=1.0,
limit_val_batches=1.0,
num_sanity_val_steps=2,
callbacks=RenderCallback(),
)
trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
trainer.test(model, dataloaders=test_data)
if __name__ == "__main__":
run()
As you can see from the logs I included, the validation epoch end hooks are called every 25 epochs as specified in the trainer:
....
train epoch end on epoch 23
train epoch end on epoch 23
on train epoch end in callback
val epoch end on epoch 24 <--- here
val epoch end on epoch 24 <--- here
on val epoch end in callback <--- here
train epoch end on epoch 24
train epoch end on epoch 24
on train epoch end in callback
train epoch end on epoch 25
train epoch end on epoch 25
on train epoch end in callback
....
Please note that in your render method, you have early returns based on some conditions. Can you please check again that your observations are correct and that you were not just tricked by some missing logs?
Thanks a lot for your help seems likes there is a mismatch between my epoch indices checks.
Bug description
I have a callback that is supposed to called during training and validation set. However, the validation part of it is never invoked. The callback is about rendering videos and logging them in wandb. The callback is supposed to happened every n epochs after the train/val step ends. The callback is never called since the videos which are supposed to create are never saved on the disk, so not wandb but pl bug. This happens when using Gpus > 1 with ddp.
There are no error messages. The videos are not created and none of the print statements indicate that this part of the callback is ever accessed.
How to reproduce the bug
Error messages and logs
There are no error messages, the code is just never invoked.
Environment
More info