Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.93k stars 3.34k forks source link

Start training using CLI on Slurm cluster #16970

Closed leopold-franz closed 1 year ago

leopold-franz commented 1 year ago

Bug description

Hi, Im trying to run a simple pytorch lightning model training on mnist data using the pytorch CLI (with yaml config) as a slurm job.

How to reproduce the bug

Im starting the slurm job using: sbatch train_submit.sh train_submit.sh:

#!/bin/bash -l

# SLURM SUBMIT SCRIPT
#SBATCH --nodes=1             # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=1   # This needs to match Trainer(devices=...)
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=5240
#SBATCH --gpus=1
#SBATCH --time=01:00:00
#SBATCH --mail-type=BEGIN,END

# activate conda env
# source activate $1

# debugging flags (optional)
# export NCCL_DEBUG=INFO
# export PYTHONFAULTHANDLER=1

# on your cluster you might need these:
# set the network interface
# export NCCL_SOCKET_IFNAME=^docker0,lo

# might need the latest CUDA
# module load NCCL/2.4.7-1-cuda.10.0

# run script from above
srun python3 cli_test.py fit --config config.yaml

config.yaml file:

seed_everything_default: null
trainer:
  accelerator: gpu
  limit_train_batches: 100
  max_epochs: 500
  devices: 1
  logger: true
  callbacks:
    - class_path: pytorch_lightning.callbacks.ModelCheckpoint
      init_args:
        save_top_k: 1
        monitor: 'val_loss'
        mode: min
        filename: 'vit-best'
    - class_path: pytorch_lightning.callbacks.ModelCheckpoint
      init_args:
        save_last: true
        filename: 'vit-last'
ckpt_path: null
log_dir: /cluster/dir/to/log

cli_test.py:

# main.py
from pytorch_lightning.cli import LightningCLI

import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import pytorch_lightning as pl

# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))

# define the LightningModule
class LitAutoEncoder(pl.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard (if installed) by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

class MNISTDataModule(pl.LightningDataModule):
    def __init__(self, data_dir: str = os.getcwd(), batch_size: int = 32):
        super().__init__()
        self.data_dir = data_dir
        self.batch_size = batch_size

    def setup(self, stage: str):
        self.mnist_test = MNIST(self.data_dir, train=False)
        self.mnist_predict = MNIST(self.data_dir, train=False)
        mnist_full = MNIST(self.data_dir, train=True)
        self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])

    def train_dataloader(self):
        return DataLoader(self.mnist_train, batch_size=self.batch_size)

    def val_dataloader(self):
        return DataLoader(self.mnist_val, batch_size=self.batch_size)

    def test_dataloader(self):
        return DataLoader(self.mnist_test, batch_size=self.batch_size)

def cli_main():
    cli = LightningCLI(LitAutoEncoder, MNISTDataModule)
    # note: don't call fit!!

if __name__ == "__main__":
    cli_main()

Error messages and logs

slurm-9842342.out (File where std:output is printed)

2023-03-06 17:02:07.694344: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
usage: cli_test.py [-h] [-c CONFIG] [--print_config ^H[=flags]]
                   {fit,validate,test,predict,tune} ...
cli_test.py: error: 'Configuration check failed :: No action for destination key "seed_everything_default" to check its value.'
srun: error: eu-g2-16: task 0: Exited with exit code 2

Environment

Current environment ``` * CUDA: - GPU: - NVIDIA GeForce GTX 1080 Ti - available: True - version: 11.3 * Lightning: - lightning-utilities: 0.7.1 - pytorch-ignite: 0.4.10 - pytorch-lightning: 1.9.4 - pytorch3dunet: 1.3.3 - torch: 1.11.0+cu113 - torch-cluster: 1.6.0 - torch-fidelity: 0.3.0 - torch-geometric: 2.0.4 - torch-scatter: 2.0.9 - torch-sparse: 0.6.13 - torch-spline-conv: 1.2.1 - torchaudio: 0.11.0+cu113 - torchmetrics: 0.11.3 - torchvision: 0.12.0+cu113 * Packages: - absl-py: 1.0.0 - accesscontrol: 5.3.1 - acquisition: 4.10 - affine: 2.3.1 - aiohttp: 3.8.1 - aiohttp-cors: 0.7.0 - aioredis: 2.0.1 - aiosignal: 1.2.0 - alabaster: 0.7.12 - alembic: 1.8.1 - amply: 0.1.5 - aniso8601: 9.0.1 - anndata: 0.8.0 - antlr4-python3-runtime: 4.9.3 - anyio: 3.6.1 - app-model: 0.1.1 - appdirs: 1.4.4 - apptools: 5.1.0 - argcomplete: 2.0.0 - argh: 0.26.2 - argon2: 0.1.10 - argon2-cffi: 21.3.0 - argon2-cffi-bindings: 21.2.0 - arviz: 0.12.1 - ase: 3.22.1 - asn1crypto: 1.5.1 - astor: 0.8.1 - asttokens: 2.0.5 - astunparse: 1.6.3 - async-generator: 1.10 - async-timeout: 4.0.2 - atomicwrites: 1.4.0 - attrs: 21.4.0 - audioread: 2.1.9 - authencoding: 4.3 - autopage: 0.5.1 - autopep8: 1.6.0 - aws-requests-auth: 0.4.3 - babel: 2.10.1 - backcall: 0.2.0 - beautifulsoup4: 4.11.1 - bidict: 0.22.0 - bids-validator: 1.9.3 - biopython: 1.79 - bitstring: 3.1.9 - black: 22.3.0 - bleach: 5.0.0 - blessings: 1.7 - blurhash: 1.1.4 - bokeh: 2.4.3 - boost: 0.1 - boto3: 1.23.10 - botocore: 1.26.10 - bottleneck: 1.3.4 - btrees: 4.10.0 - build: 0.10.0 - cachetools: 5.2.0 - cachey: 0.2.1 - cellmodeller: b-v4.3-42-g96ab099- - certifi: 2022.5.18.1 - certipy: 0.1.3 - cffi: 1.15.0 - cftime: 1.6.0 - chainer: 7.8.1 - chameleon: 3.10.1 - chardet: 4.0.0 - charset-normalizer: 2.0.12 - chex: 0.1.3 - clang: 14.0 - click: 8.1.3 - click-plugins: 1.1.1 - cligj: 0.7.2 - clikit: 0.6.2 - cloudpickle: 2.1.0 - cmaes: 0.9.1 - cmake: 3.24.1.1 - cmd2: 2.4.1 - codecov: 2.1.12 - colorama: 0.4.4 - coloredlogs: 15.0.1 - colorful: 0.5.4 - colorlog: 6.6.0 - colorlover: 0.3.0 - colormath: 3.0.0 - commonmark: 0.9.1 - configargparse: 1.5.3 - configobj: 5.0.6 - configparser: 5.2.0 - connection-pool: 0.0.3 - contextlib2: 21.6.0 - coverage: 6.4 - crashtest: 0.3.1 - cryptography: 38.0.4 - cucim: 23.2.0 - cufflinks: 0.17.3 - cupy-cuda11x: 11.1.0 - cutadapt: 4.0 - cutensor: 1.6.0.3 - cvxopt: 1.3.0 - cvxpy: 1.2.1 - cycler: 0.11.0 - cython: 0.29.30 - dask: 2022.5.2 - databricks-cli: 0.17.4 - datasets: 2.5.1 - datetime: 4.4 - datrie: 0.8.2 - deap: 1.3.1 - debtcollector: 2.5.0 - debugpy: 1.6.0 - decorator: 5.1.1 - deepdiff: 5.8.1 - defusedxml: 0.7.1 - deprecated: 1.2.13 - deprecation: 2.1.0 - descartes: 1.1.0 - dill: 0.3.5.1 - distributed: 2022.5.2 - distro: 1.8.0 - dm-tree: 0.1.7 - dnaio: 0.9.0 - dnspython: 2.2.1 - docker: 6.0.1 - docker-pycreds: 0.4.0 - docopt: 0.6.2 - docstring-parser: 0.15 - documenttemplate: 4.0 - docutils: 0.17.1 - dpath: 2.0.6 - easydict: 1.9 - ecos: 2.0.10 - einops: 0.4.1 - entrypoints: 0.4 - envisage: 6.0.1 - ephem: 4.1.3 - esda: 2.4.1 - et-xmlfile: 1.1.0 - etils: 0.8.0 - eventlet: 0.33.1 - evo: 1.18.1 - executing: 0.8.3 - extensionclass: 4.6 - extras: 1.0.0 - fasteners: 0.17.3 - fastjsonschema: 2.15.3 - fastprogress: 1.0.2 - fastrlock: 0.8 - filelock: 3.7.0 - findlibs: 0.0.2 - fiona: 1.8.22 - fire: 0.5.0 - flask: 2.1.2 - flask-cors: 3.0.10 - flask-json: 0.3.4 - flask-restplus: 0.13.0 - flask-restx: 0.5.1 - flatbuffers: 1.12 - flit: 3.7.1 - flit-core: 3.7.1 - flowvision: 0.2.0 - follicle-tracker: 0.1.dev221+gc3cd246 - fonttools: 4.33.3 - freetype-py: 2.3.0 - frozenlist: 1.3.0 - fsspec: 2022.5.0 - funcsigs: 1.0.2 - future: 0.18.2 - futurist: 2.4.1 - gast: 0.4.0 - gdown: 4.4.0 - geopandas: 0.12.2 - gevent: 21.12.0 - giddy: 2.3.3 - gitdb: 4.0.9 - gitdb2: 4.0.2 - gitpython: 3.1.27 - gmpy2: 2.1.5 - google-api-core: 2.8.1 - google-auth: 2.6.6 - google-auth-oauthlib: 0.4.6 - google-pasta: 0.2.0 - googleapis-common-protos: 1.56.2 - googledrivedownloader: 0.4 - gpaw: 22.8.0 - gprmax: 3.1.4 - gpustat: 0.6.0 - grabbit: 0.2.6 - graphtools: 1.5.2 - greenlet: 1.1.2 - grpcio: 1.46.3 - gunicorn: 20.1.0 - h3: 3.7.4 - h5py: 3.7.0 - haversine: 2.5.1 - hdbscan: 0.8.29 - heapdict: 1.0.1 - hiredis: 2.0.0 - hsluv: 5.0.3 - html5lib: 1.1 - httpstan: 4.8.2 - huggingface-hub: 0.7.0 - humanfriendly: 10.0 - hydra-core: 1.2.0 - hyperopt: 0.2.7 - idna: 3.3 - ifcfg: 0.22 - imagecodecs: 2023.1.23 - imageio: 2.19.3 - imageio-ffmpeg: 0.4.7 - imagesize: 1.3.0 - importlib-metadata: 4.11.4 - importlib-resources: 5.7.1 - in-n-out: 0.1.7 - inequality: 1.0.0 - iniconfig: 1.1.1 - install: 1.3.5 - iopath: 0.1.6 - ipdb: 0.13.9 - ipykernel: 6.13.0 - ipython: 8.4.0 - ipython-genutils: 0.2.0 - ipywidgets: 7.7.0 - isal: 0.11.1 - iso3166: 2.0.2 - iso8601: 1.0.2 - isodate: 0.6.1 - iteration-utilities: 0.11.0 - itk: 5.3.0 - itk-core: 5.3.0 - itk-filtering: 5.3.0 - itk-io: 5.3.0 - itk-numerics: 5.3.0 - itk-registration: 5.3.0 - itk-segmentation: 5.3.0 - itsdangerous: 2.1.2 - jax: 0.3.23 - jaxlib: 0.3.22+cuda11.cudnn82 - jedi: 0.18.1 - jeepney: 0.8.0 - jieba: 0.42.1 - jinja2: 3.1.2 - jmespath: 1.0.0 - joblib: 1.1.0 - json-tricks: 3.16.1 - json5: 0.9.8 - jsonargparse: 4.20.0 - jsonlines: 1.2.0 - jsonpickle: 2.2.0 - jsonpointer: 2.3 - jsonschema: 4.5.1 - jupyter: 1.0.0 - jupyter-client: 7.3.1 - jupyter-console: 6.4.3 - jupyter-contrib-core: 0.3.3 - jupyter-core: 4.10.0 - jupyter-highlight-selected-word: 0.2.0 - jupyter-server: 1.17.0 - jupyter-telemetry: 0.1.0 - jupyterlab: 3.4.2 - jupyterlab-pygments: 0.2.2 - jupyterlab-server: 2.14.0 - jupyterlab-widgets: 1.1.0 - keras: 2.9.0 - keras-preprocessing: 1.1.2 - keyring: 23.5.1 - kiwisolver: 1.4.2 - lazy-object-proxy: 1.7.1 - libclang: 14.0.1 - libpysal: 4.6.2 - lightning-utilities: 0.7.1 - llvmlite: 0.38.1 - lmdb: 1.4.0 - locket: 1.0.0 - logutils: 0.3.5 - loompy: 3.0.7 - lxml: 4.8.0 - lz4: 4.0.1 - lzstring: 1.0.4 - mageck: 0.5.9.4 - magicgui: 0.7.0 - mako: 1.2.0 - mapclassify: 2.4.3 - markdown: 3.3.7 - markupsafe: 2.1.1 - marshmallow: 3.18.0 - mastodon.py: 1.8.0 - matplotlib: 3.5.2 - matplotlib-inline: 0.1.3 - mccabe: 0.7.0 - mercantile: 1.2.1 - mgwr: 2.1.2 - mistune: 0.8.4 - mlflow: 2.2.1 - mock: 4.0.3 - monai: 1.1.0 - more-itertools: 8.13.0 - mpi4py: 3.1.4 - mpmath: 1.2.1 - msgpack: 1.0.3 - multidict: 6.0.2 - multimapping: 4.1 - multipart: 0.2.4 - multiprocess: 0.70.13 - multiqc: 1.13 - munch: 2.5.0 - mypy-extensions: 0.4.3 - napari: 0.4.17 - napari-console: 0.0.7 - napari-plugin-engine: 0.2.0 - napari-svg: 0.1.6 - natsort: 8.1.0 - nbclassic: 0.3.7 - nbclient: 0.6.3 - nbconvert: 6.5.0 - nbformat: 5.4.0 - nbsphinx: 0.8.8 - nest-asyncio: 1.5.5 - netaddr: 0.8.0 - netcdf4: 1.5.8 - netifaces: 0.11.0 - networkx: 2.8.2 - nibabel: 3.2.2 - ninja: 1.11.1 - nipy: 0.5.0 - nltk: 3.7 - nni: 2.10 - nose: 1.3.7 - nose-timer: 1.0.1 - notebook: 6.4.11 - notebook-shim: 0.1.0 - npe2: 0.6.2 - nptyping: 2.5.0 - num2words: 0.5.10 - numba: 0.55.2 - numexpr: 2.8.1 - numpy: 1.22.4 - numpy-groupies: 0.9.16 - numpy-quaternion: 2022.4.2 - numpydoc: 1.5.0 - nvidia-ml-py3: 7.352.0 - oauthlib: 3.2.0 - omegaconf: 2.2.2 - opencensus: 0.9.0 - opencensus-context: 0.1.2 - opencv-contrib-python: 4.5.5.64 - opencv-python: 4.5.5.64 - openpyxl: 3.0.10 - openseespy: 3.3.0.1.1 - openseespylinux: 3.4.0.1 - openslide-python: 1.1.2 - opt-einsum: 3.3.0 - optax: 0.1.2 - optuna: 3.1.0 - ordered-set: 4.1.0 - os-service-types: 1.7.0 - oslo.i18n: 5.1.0 - osmnx: 1.2.2 - osqp: 0.6.2.post5 - ovary-analysis: 0.0.3 - overpy: 0.6 - packaging: 21.3 - pamela: 1.0.0 - pandas: 1.4.2 - pandas-datareader: 0.10.0 - pandoc: 2.2 - pandocfilters: 1.5.0 - parso: 0.8.3 - partd: 1.2.0 - paste: 3.5.0 - pastedeploy: 2.1.1 - pastel: 0.2.1 - pathos: 0.2.9 - pathspec: 0.9.0 - pathtools: 0.1.2 - patsy: 0.5.2 - pbr: 5.9.0 - persistence: 3.3 - persistent: 4.9.0 - pert: 2019.11 - pexpect: 4.8.0 - pickleshare: 0.7.5 - pillow: 9.1.1 - pint: 0.19.2 - pip: 22.2.2 - pkginfo: 1.8.2 - plac: 1.3.5 - platformdirs: 2.5.2 - plotly: 5.8.0 - pluggy: 1.0.0 - plumbum: 1.7.2 - ply: 3.11 - pointpats: 2.2.0 - pooch: 1.6.0 - portalocker: 2.4.0 - pox: 0.3.1 - ppft: 1.7.6.5 - prettytable: 3.3.0 - prometheus-client: 0.14.1 - promise: 2.3 - prompt-toolkit: 3.0.29 - protobuf: 3.19.4 - psutil: 5.9.1 - psygnal: 0.8.1 - ptyprocess: 0.7.0 - pulp: 2.6.0 - pure-eval: 0.2.2 - py: 1.11.0 - py-spy: 0.3.12 - py4design: 0.28 - py4j: 0.10.9.5 - pyarrow: 9.0.0 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pybind11: 2.9.2 - pybis: 1.35.2 - pybufrkit: 0.2.19 - pycocotools: 2.0.4 - pycodestyle: 2.8.0 - pycollada: 0.7.2 - pycparser: 2.21 - pydantic: 1.10.5 - pydicom: 2.3.1 - pydot: 1.4.2 - pyepsg: 0.4.0 - pyface: 7.4.1 - pyfaidx: 0.6.4 - pyflakes: 2.5.0 - pyglet: 1.5.26 - pygments: 2.12.0 - pygsp: 0.5.1 - pygsti: 0.9.10.1 - pyinotify: 0.9.6 - pyjwt: 2.6.0 - pylev: 1.4.0 - pymeshfix: 0.16.2 - pymf: 0.1.9 - pymongo: 4.1.1 - pynrrd: 1.0.0 - pyomo: 6.4.1 - pyopencl: 2022.1.5 - pyopengl: 3.1.6 - pyopenssl: 22.1.0 - pyparsing: 3.0.9 - pyperclip: 1.8.2 - pyproj: 3.4.1 - pyproject-hooks: 1.0.0 - pypsa: 0.19.3 - pyqt5: 5.15.6 - pyqt5-qt5: 5.15.2 - pyqt5-sip: 12.10.1 - pyro4: 4.82 - pyrsistent: 0.18.1 - pysam: 0.19.1 - pyshp: 2.3.0 - pysimdjson: 3.2.0 - pysocks: 1.7.1 - pystan: 3.5.0 - pytest: 7.1.2 - python-dateutil: 2.8.2 - python-engineio: 4.3.2 - python-gettext: 4.0 - python-json-logger: 2.0.4 - python-louvain: 0.16 - python-magic: 0.4.27 - python-socketio: 5.6.0 - pythonwebhdfs: 0.2.3 - pytoml: 0.1.21 - pytomlpp: 1.0.11 - pytools: 2022.1.9 - pytorch-ignite: 0.4.10 - pytorch-lightning: 1.9.4 - pytorch3dunet: 1.3.3 - pytz: 2022.1 - pyutilib: 6.0.0 - pyutillib: 0.3.0 - pyvista: 0.38.3 - pywavelets: 1.3.0 - pyxlsb: 1.0.9 - pyyaml: 6.0 - pyzmq: 23.0.0 - qdldl: 0.1.5.post2 - qtconsole: 5.3.0 - qtpy: 2.1.0 - quantecon: 0.5.3 - querystring-parser: 1.2.4 - quilt3: 5.0.0 - rasterio: 1.3.6 - rasterstats: 0.18.0 - ratelimiter: 1.2.0.post0 - rdflib: 6.1.1 - readme-renderer: 35.0 - recommonmark: 0.7.1 - redis: 4.3.1 - rednose: 1.3.0 - regex: 2022.4.24 - reportlab: 3.6.9 - repoze.lru: 0.7 - requests: 2.28.2 - requests-futures: 1.0.0 - requests-oauthlib: 1.3.1 - requests-toolbelt: 0.9.1 - requests-unixsocket: 0.3.0 - requestsexceptions: 1.4.0 - resampy: 0.2.2 - responses: 0.18.0 - restrictedpython: 5.3a1.dev0 - retry: 0.9.2 - retrying: 1.3.3 - rfc3986: 2.0.0 - rich: 12.4.4 - rich-click: 1.5.2 - roman: 3.3 - rosbags: 0.9.11 - routes: 2.5.1 - rsa: 4.8 - rtree: 1.0.0 - ruamel.yaml: 0.17.21 - ruamel.yaml.clib: 0.2.6 - rvlib: 0.0.6 - s3transfer: 0.5.2 - salib: 1.4.5 - schema: 0.7.5 - scikit-build: 0.16.7 - scikit-fmm: 2022.3.26 - scikit-image: 0.19.2 - scikit-learn: 1.1.1 - scipy: 1.8.1 - scons: 4.4.0 - scooby: 0.7.1 - scs: 3.2.0 - seaborn: 0.11.2 - secretstorage: 3.3.2 - semver: 2.13.0 - send2trash: 1.8.0 - sentence-transformers: 2.2.0 - sentencepiece: 0.1.96 - sentry-sdk: 1.5.12 - serpent: 1.40 - setproctitle: 1.2.3 - setuptools: 58.1.0 - setuptools-scm: 6.4.2 - shap: 0.41.0 - shapely: 1.8.5.post1 - shortuuid: 1.0.9 - simplegeneric: 0.8.1 - simplejson: 3.17.6 - six: 1.16.0 - slicer: 0.0.7 - smart-open: 6.0.0 - smmap: 5.0.0 - smmap2: 3.0.1 - snakemake: 7.8.0 - sniffio: 1.2.0 - snowballstemmer: 2.2.0 - snuggs: 1.4.7 - sortedcontainers: 2.4.0 - soupsieve: 2.3.2.post1 - spaghetti: 1.6.5 - spectra: 0.0.11 - spglm: 1.0.8 - sphinx: 4.5.0 - sphinxcontrib-applehelp: 1.0.2 - sphinxcontrib-devhelp: 1.0.2 - sphinxcontrib-htmlhelp: 2.0.0 - sphinxcontrib-jsmath: 1.0.1 - sphinxcontrib-qthelp: 1.0.3 - sphinxcontrib-serializinghtml: 1.1.5 - sphinxcontrib-websupport: 1.2.4 - spint: 1.0.7 - spreg: 1.2.4 - spvcm: 0.3.0 - sqlalchemy: 1.4.37 - sqlparse: 0.4.2 - stack-data: 0.2.0 - staticmap: 0.5.5 - statsd: 3.3.0 - statsmodels: 0.13.2 - stevedore: 3.5.0 - stopit: 1.1.2 - subprocess32: 3.5.4 - superqt: 0.4.1 - svg.path: 6.0 - sympy: 1.10.1 - tables: 3.7.0 - tabulate: 0.8.9 - tasklogger: 1.1.2 - tblib: 1.7.0 - tempita: 0.5.2 - tenacity: 8.0.1 - tensorboard: 2.9.0 - tensorboard-data-server: 0.6.1 - tensorboard-plugin-wit: 1.8.1 - tensorboardx: 2.5 - tensorflow-estimator: 2.9.0 - tensorflow-gpu: 2.9.1 - tensorflow-io-gcs-filesystem: 0.26.0 - termcolor: 1.1.0 - terminado: 0.15.0 - terminaltables: 3.1.10 - termstyle: 0.1.11 - testpath: 0.6.0 - testresources: 2.0.1 - texttable: 1.6.4 - theano: 1.0.5 - theano-pymc: 1.1.2 - threadpoolctl: 3.1.0 - tifffile: 2022.5.4 - timezonefinder: 6.0.0 - tinycss2: 1.1.1 - tokenizers: 0.12.1 - toml: 0.10.2 - tomli: 2.0.1 - tomli-w: 1.0.0 - tomlkit: 0.11.0 - toolz: 0.11.2 - toposort: 1.7 - torch: 1.11.0+cu113 - torch-cluster: 1.6.0 - torch-fidelity: 0.3.0 - torch-geometric: 2.0.4 - torch-scatter: 2.0.9 - torch-sparse: 0.6.13 - torch-spline-conv: 1.2.1 - torchaudio: 0.11.0+cu113 - torchmetrics: 0.11.3 - torchvision: 0.12.0+cu113 - tornado: 6.1 - tqdm: 4.64.0 - traitlets: 5.2.1.post0 - traits: 6.3.2 - traitsui: 7.3.1 - transaction: 3.0.1 - transformers: 4.19.2 - trimesh: 3.12.5 - twine: 4.0.1 - typeguard: 2.13.3 - typer: 0.7.0 - typeshed-client: 2.2.0 - typing-extensions: 4.2.0 - urllib3: 1.26.9 - utm: 0.7.0 - velocyto: 0.17.17 - vine: 5.0.0 - vispy: 0.11.0 - vtk: 9.2.6 - waitress: 2.1.1 - wandb: 0.12.17 - wcwidth: 0.2.5 - webargs: 8.2.0 - webencodings: 0.5.1 - webob: 1.8.7 - websocket: 0.2.1 - websocket-client: 1.3.2 - websockets: 10.4 - webtest: 3.0.0 - werkzeug: 2.1.2 - wget: 3.2 - wheel: 0.37.1 - widgetsnbextension: 3.6.0 - wntr: 0.4.1 - wrapt: 1.14.1 - wsgiproxy2: 0.5.1 - wsme: 0.11.0 - xarray: 2022.3.0 - xarray-einstats: 0.2.2 - xlrd: 2.0.1 - xlsxwriter: 3.0.3 - xlwt: 1.3.0 - xmlrunner: 1.7.7 - xopen: 1.5.0 - xxhash: 3.0.0 - xyzservices: 2022.4.0 - yacs: 0.1.8 - yappi: 1.3.5 - yarl: 1.7.2 - yaspin: 2.1.0 - yte: 1.4.0 - z3c.pt: 3.3.1 - zc.lockfile: 2.0 - zconfig: 3.6.0 - zexceptions: 4.2 - zict: 2.2.0 - zipp: 3.8.0 - zodb: 5.7.0 - zodbpickle: 2.3 - zope: 5.5.1 - zope.annotation: 4.7.0 - zope.browser: 2.4 - zope.browsermenu: 4.4 - zope.browserpage: 4.4.0 - zope.browserresource: 4.4 - zope.cachedescriptors: 4.3.1 - zope.component: 5.0.1 - zope.configuration: 4.4.1 - zope.container: 4.5.0 - zope.contentprovider: 4.2.1 - zope.contenttype: 4.5.0 - zope.datetime: 4.3.0 - zope.deferredimport: 4.4 - zope.deprecation: 4.4.0 - zope.dottedname: 4.3 - zope.event: 4.5.0 - zope.exceptions: 4.5 - zope.filerepresentation: 5.0.0 - zope.globalrequest: 1.5 - zope.hookable: 5.1.0 - zope.i18n: 4.9.0 - zope.i18nmessageid: 5.0.1 - zope.interface: 5.4.0 - zope.lifecycleevent: 4.4 - zope.location: 4.2 - zope.pagetemplate: 4.6.0 - zope.processlifetime: 2.3.0 - zope.proxy: 4.5.0 - zope.ptresource: 4.3.0 - zope.publisher: 6.1.0 - zope.schema: 6.2.0 - zope.security: 5.3 - zope.sequencesort: 4.2 - zope.site: 4.5.0 - zope.size: 4.3 - zope.structuredtext: 4.4 - zope.tal: 4.5 - zope.tales: 5.1 - zope.testbrowser: 5.6.1 - zope.testing: 4.10 - zope.traversing: 4.4.1 - zope.viewlet: 4.3 - zstandard: 0.17.0 * System: - OS: Linux - architecture: - 64bit - ELF - processor: - python: 3.10.4 - version: #1 SMP Tue Nov 8 15:48:59 UTC 2022 ```

More info

No response

cc @carmocca @mauvilsa

awaelchli commented 1 year ago

Hey, I think the problem is that these keys in the config.yaml are not allowed:

seed_everything_default: null
log_dir: /cluster/dir/to/log

They don't match anything in the Trainer.

Perhaps it should be

seed_everything: false
trainer:
    default_root_dir:  "/cluster/dir/to/log"
    ...
awaelchli commented 1 year ago

Hi

I tried to help here, did you find what the problem was? Please let me know.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

leopold-franz commented 1 year ago

Yes sorry I forgot to answer. I somehow messed up a lot of the key settings, so you were right. Thank you for your help

awaelchli commented 1 year ago

Thanks for confirming that it worked. Happy this was helpful.