Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.87k stars 3.34k forks source link

lightning version is the SLURM job number when run on a node provisioned by SLURM #17620

Open finnoshea opened 1 year ago

finnoshea commented 1 year ago

Bug description

When I use lightning on my personal computer the logs are named with an ascending version number. When I use lightning on a cluster where access is provisioned via SLURM, the version number is the job id number of the SLURM job:

Epoch 9: 100%|█| 166/166 [01:25<00:00,  1.95it/s, loss=0.944, v_num=8483572, train_loss_step=0.925, val_loss=0.896...

In the above line v_num=8483572 is the SLURM job number. The logs are also saved in a directory named version_8483572.

I don't know if this is working as intended.

What version are you seeing the problem on?

v1.9

How to reproduce the bug

The following script from pytorch-forecasting will reproduce the possible bug on the cluster I work on:

import sys

import pandas as pd
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping
from sklearn.preprocessing import scale

from pytorch_forecasting import NBeats, TimeSeriesDataSet
from pytorch_forecasting.data import NaNLabelEncoder
from pytorch_forecasting.data.examples import generate_ar_data

sys.path.append("..")

print("load data")
data = generate_ar_data(seasonality=10.0, timesteps=400, n_series=100)
data["static"] = 2
data["date"] = pd.Timestamp("2020-01-01") + pd.to_timedelta(data.time_idx, "D")
# validation = data.series.sample(20)

max_encoder_length = 150
max_prediction_length = 20

training_cutoff = data["time_idx"].max() - max_prediction_length

context_length = max_encoder_length
prediction_length = max_prediction_length

training = TimeSeriesDataSet(
    data[lambda x: x.time_idx < training_cutoff],
    time_idx="time_idx",
    target="value",
    categorical_encoders={"series": NaNLabelEncoder().fit(data.series)},
    group_ids=["series"],
    min_encoder_length=context_length,
    max_encoder_length=context_length,
    max_prediction_length=prediction_length,
    min_prediction_length=prediction_length,
    time_varying_unknown_reals=["value"],
    randomize_length=None,
    add_relative_time_idx=False,
    add_target_scales=False,
)

validation = TimeSeriesDataSet.from_dataset(training, data,
                                            min_prediction_idx=training_cutoff)
batch_size = 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size,
                                          num_workers=5)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size,
                                          num_workers=5)

early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4,
                                    patience=10, verbose=False, mode="min")
trainer = pl.Trainer(
    max_epochs=10,
    accelerator='cpu',
    devices=1,
    gradient_clip_val=0.1,
    callbacks=[early_stop_callback],
    limit_train_batches=1.0,
    log_every_n_steps=1,
    # limit_val_batches=1,
    # fast_dev_run=True,
    # logger=logger,
    # profiler=True,
)

net = NBeats.from_dataset(
    training, learning_rate=3e-2, log_interval=10, log_val_interval=1, log_gradient_flow=False, weight_decay=1e-2
)
print(f"Number of parameters in network: {net.size()/1e3:.1f}k")

# # find optimal learning rate
# # remove logging and artificial epoch size
# net.hparams.log_interval = -1
# net.hparams.log_val_interval = -1
# trainer.limit_train_batches = 1.0
# # run learning rate finder
# res = trainer.tuner.lr_find(
#     net, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader, min_lr=1e-5, max_lr=1e2
# )
# print(f"suggested learning rate: {res.suggestion()}")
# fig = res.plot(show=True, suggest=True)
# fig.show()
# net.hparams.learning_rate = res.suggestion()

trainer.fit(
    net,
    train_dataloaders=train_dataloader,
    val_dataloaders=val_dataloader,
)

### Error messages and logs

Error messages and logs here please


### Environment

<details>
  <summary>Current environment</summary>

- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):

- PyTorch Lightning Version (e.g., 1.5.0):

- Lightning App Version (e.g., 0.5.2):

- PyTorch Version (e.g., 2.0):

- Python version (e.g., 3.9):

- OS (e.g., Linux):

- CUDA/cuDNN version:

- GPU models and configuration:

- How you installed Lightning(conda, pip, source):

- Running environment of LightningApp (e.g. local, cloud):



</details>

### More info

_No response_

cc @awaelchli
awaelchli commented 1 year ago

I remember very well that this was the behavior since the very beginning of Lightning and we've carried this special treatment in SLURM forward through many refactors, always keeping it for backward compatibility. So I think it is so far expected. We've never heard complaints from SLURM users about it, so I assume it is not a big issue. I wouldn't be opposed to changing it but we should only do it if there is a good demand for it, since it would be a breaking change (IMO).

I haven't used SLURM in some time, but I could imagine that the regular versioning could be problematic if users launch a batch of jobs in Slurm and then the folders could be colliding unless users name their folders manually. The job id for this use case provides a unique name so that version folders don't collide. I agree it may seem arbitrary, so I understand the concern.

awaelchli commented 1 year ago

I'm pinging a couple of users who have previously posted issues recently on our GitHub about SLURM: @shethdhvani @Queuecumber @ipoletaev @ChristophReich1996 @Wildcarde if you've used the TensorBoardLogger versioning in SLURM and want to share your opinion on this behavior.

Queuecumber commented 1 year ago

I actually haven't noticed this, maybe because I use wandb logger (I didn't read the issue fully so I don't know if the code in question is specific to tb)

That said I would love it if other loggers worked that way; I had to put in my own code to do it because it allows me to resume my previous runs when my job gets preempted and requeued.

My guess is that this is where the behavior came from btw

awaelchli commented 1 year ago

Yes, wandb saves differently and always creates unique runs even if the user-defined run name is not unique. So versioning is handled differently there.

Right, you're making a good point actually. We have also the auto-resubmit feature, and if SLURM resumes a job, the job id stays the same (right?).

I had to put in my own code to do it because it allows me to resume my previous runs

On that note, I think we should pick up an old feature request for storing the experiment id into the checkpoint, so if resuming from a checkpoint, the logger could resume into the same experiment (https://github.com/Lightning-AI/lightning/issues/5342)

finnoshea commented 1 year ago

It seems like I have an unusual use pattern for SLURM jobs. That is cool. I have been meaning to re-learn tensorboard anyway.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!