Start training using CLI on Slurm cluster #16970

Closed leopold-franz closed 1 year ago

leopold-franz commented 1 year ago

Bug description

Hi, Im trying to run a simple pytorch lightning model training on mnist data using the pytorch CLI (with yaml config) as a slurm job.

How to reproduce the bug

Im starting the slurm job using: sbatch

#!/bin/bash -l

#SBATCH --nodes=1             # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=1   # This needs to match Trainer(devices=...)
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=5240
#SBATCH --gpus=1
#SBATCH --time=01:00:00
#SBATCH --mail-type=BEGIN,END

# activate conda env
# source activate $1

# debugging flags (optional)

# on your cluster you might need these:
# set the network interface
# export NCCL_SOCKET_IFNAME=^docker0,lo

# might need the latest CUDA
# module load NCCL/2.4.7-1-cuda.10.0

# run script from above
srun python3 fit --config config.yaml

config.yaml file:

seed_everything_default: null
  accelerator: gpu
  limit_train_batches: 100
  max_epochs: 500
  devices: 1
  logger: true
    - class_path: pytorch_lightning.callbacks.ModelCheckpoint
        save_top_k: 1
        monitor: 'val_loss'
        mode: min
        filename: 'vit-best'
    - class_path: pytorch_lightning.callbacks.ModelCheckpoint
        save_last: true
        filename: 'vit-last'
ckpt_path: null
log_dir: /cluster/dir/to/log

from pytorch_lightning.cli import LightningCLI

import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import pytorch_lightning as pl

# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))

# define the LightningModule
class LitAutoEncoder(pl.LightningModule):
    def __init__(self, encoder, decoder):
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard (if installed) by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

class MNISTDataModule(pl.LightningDataModule):
    def __init__(self, data_dir: str = os.getcwd(), batch_size: int = 32):
        self.data_dir = data_dir
        self.batch_size = batch_size

    def setup(self, stage: str):
        self.mnist_test = MNIST(self.data_dir, train=False)
        self.mnist_predict = MNIST(self.data_dir, train=False)
        mnist_full = MNIST(self.data_dir, train=True)
        self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])

    def train_dataloader(self):
        return DataLoader(self.mnist_train, batch_size=self.batch_size)

    def val_dataloader(self):
        return DataLoader(self.mnist_val, batch_size=self.batch_size)

    def test_dataloader(self):
        return DataLoader(self.mnist_test, batch_size=self.batch_size)

def cli_main():
    cli = LightningCLI(LitAutoEncoder, MNISTDataModule)
    # note: don't call fit!!

if __name__ == "__main__":

Error messages and logs

slurm-9842342.out (File where std:output is printed)

2023-03-06 17:02:07.694344: I tensorflow/core/util/] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
usage: [-h] [-c CONFIG] [--print_config ^H[=flags]]
                   {fit,validate,test,predict,tune} ... error: 'Configuration check failed :: No action for destination key "seed_everything_default" to check its value.'
srun: error: eu-g2-16: task 0: Exited with exit code 2


More info

No response

cc @carmocca @mauvilsa

awaelchli commented 1 year ago

Hey, I think the problem is that these keys in the config.yaml are not allowed:

seed_everything_default: null
log_dir: /cluster/dir/to/log

They don't match anything in the Trainer.

Perhaps it should be

seed_everything: false
    default_root_dir:  "/cluster/dir/to/log"
awaelchli commented 1 year ago


I tried to help here, did you find what the problem was? Please let me know.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

leopold-franz commented 1 year ago

Yes sorry I forgot to answer. I somehow messed up a lot of the key settings, so you were right. Thank you for your help

awaelchli commented 1 year ago

Thanks for confirming that it worked. Happy this was helpful.