Closed thesofakillers closed 1 year ago
Hi @thesofakillers That's because the callback only implements the training hooks right now. Adding support for multiple stages would be welcome!
Hi, I'd like to work on this. I'm new to this library, but am currently reading through everything related, the Trainer run function, the train/eval/predict Loops and EpochLoops, the logger connector.
Currently, for 'fit' runs DeviceStatsMonitor only logs every n steps as defined by Trainer's 'log_every_n_steps' variable.
How do we decide how often to log for 'test', AKA 'eval', runs? With the same 'log_every_n_steps' variable or something else?
To have a base to start with, here is a fork where I enabled DeviceStatsMonitor logging for eval runs. f37d373e603cb82e34c07b932eda706c43cb1830
Test code used:
import pytorch_lightning as pl
import numpy as np
from pytorch_lightning.callbacks import DeviceStatsMonitor
import torch
from torch.nn import MSELoss
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset
import torch.nn as nn
import os
class SimpleDataset(Dataset):
def __init__(self):
X = np.arange(10000)
y = X * 2
X = [[n] for n in X]
y = [[n] for n in y]
self.X = torch.Tensor(X)
self.y = torch.Tensor(y)
def __len__(self):
return len(self.y)
def __getitem__(self, idx):
return {"X": self.X[idx], "y": self.y[idx]}
class MyModel(pl.LightningModule):
def __init__(self):
super().__init__()
self.fc = nn.Linear(1, 1)
self.criterion = MSELoss()
def forward(self, inputs_id, labels=None):
outputs = self.fc(inputs_id)
loss = 0
if labels is not None:
loss = self.criterion(outputs, labels)
return loss, outputs
def train_dataloader(self):
dataset = SimpleDataset()
return DataLoader(dataset, batch_size=1000)
def test_dataloader(self):
dataset = SimpleDataset()
return DataLoader(dataset, batch_size=1000)
def training_step(self, batch, batch_idx):
input_ids = batch["X"]
labels = batch["y"]
loss, outputs = self(input_ids, labels)
return {"loss": loss, "outputs": outputs}
def test_step(self, batch, batch_idx):
input_ids = batch["X"]
labels = batch["y"]
loss, outputs = self(input_ids, labels)
return {"loss": loss}
def configure_optimizers(self):
optimizer = Adam(self.parameters())
return optimizer
if __name__ == "__main__":
print('hello' + os.getcwd() + 'hello')
model = MyModel()
logger = pl.loggers.CSVLogger(save_dir="example", name="test")
trainer = pl.Trainer(
logger=logger, max_epochs=5, callbacks=[DeviceStatsMonitor(cpu_stats=True), ], log_every_n_steps=1
)
trainer.fit(model)
trainer.test(model)
I logged the same every N runs as fit runs. However, the fit epoch loop keeps track of a _batches_that_stepped, but the eval epoch loop does not. As far as I can tell (not certain), the eval epoch loop's variable batch_progress.total.completed tracks the same thing,
Clarifications/comments/instructions welcome!
When I know what to do, I will continue, can add support for logging in predict loop, and eventually send a PR.
not stale
Bug description
I would like to use DeviceStatsMonitor during a trainer.test() call. I followed the relative documentation which makes no mention of whether this callback is exclusive to trainer.fit().
Despite following the docs, I get no device stats logs in my tensorboard
How to reproduce the bug
run the following script. You will see that no stats will be logged, despite having the DeviceStatsMonitor callback
Environment
More info
I have verified this on both GPU and CPU. The example above uses CPU.
cc @borda @awaelchli