Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.48k stars 3.3k forks source link

PermissionError with ModelCheckpoints #19397

Open aaprasad opened 5 months ago

aaprasad commented 5 months ago

Hi I'm trying to train a model and am getting this error:

Traceback (most recent call last):
  File "/home/jovyan/talmolab-smb/aadi/biogtr_expts/run/animal/SLAP_M74/single_run.py", line 91, in <module>
    main(cfg.cfg)
  File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/hydra/main.py", line 83, in decorated_main
    return task_function(cfg_passthrough)
  File "/home/jovyan/talmolab-smb/aadi/biogtr_expts/src/biogtr/biogtr/training/train.py", line 101, in main
    trainer.fit(model, dataset)
  File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 203, in run
    self.on_advance_end()
  File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 374, in on_advance_end
    call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=True)
  File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 314, in on_train_epoch_end
    self._save_last_checkpoint(trainer, monitor_candidates)
  File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 679, in _save_last_checkpoint
    self._link_checkpoint(trainer, self._last_checkpoint_saved, filepath)
  File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 397, in _link_checkpoint
    shutil.copy(filepath, linkpath)
  File "/opt/conda/envs/biogtr/lib/python3.9/shutil.py", line 428, in copy
    copymode(src, dst, follow_symlinks=follow_symlinks)
  File "/opt/conda/envs/biogtr/lib/python3.9/shutil.py", line 317, in copymode
    chmod_func(dst, stat.S_IMODE(st.st_mode))
PermissionError: [Errno 1] Operation not permitted: '/home/jovyan/talmolab-smb/aadi/biogtr_expts/run/animal/SLAP_M74/models/tests/test_chkpt/epoch=1-best-val_num_switches=36.0.ckpt'

This is how i set up my checkpoints:

def get_checkpointing(self) -> pl.callbacks.ModelCheckpoint:
        """Getter for lightning checkpointing callback.

        Returns:
            A lightning checkpointing callback with specified params
        """
        # convert to dict to enable extracting/removing params
        checkpoint_params = OmegaConf.to_container(self.cfg.checkpointing, resolve=True)
        logging_params = self.cfg.logging
        if "dirpath" not in checkpoint_params or checkpoint_params["dirpath"] is None:
            if "group" in logging_params:
                dirpath = f"./models/{logging_params.group}/{logging_params.name}"
            else:
                dirpath = f"./models/{logging_params.name}"

        else:
            dirpath = checkpoint_params["dirpath"]

        dirpath = Path(dirpath).resolve()
        if not Path(dirpath).exists():
            try:
                Path(dirpath).mkdir(parents=True, exist_ok=True)
            except OSError as e:
                print(
                    f"Cannot create a new folder. Check the permissions to the given Checkpoint directory. \n {e}"
                )

        _ = checkpoint_params.pop("dirpath")
        checkpointers = []
        monitor = checkpoint_params.pop("monitor")
        for metric in monitor:
            checkpointer = pl.callbacks.ModelCheckpoint(
                monitor=metric, dirpath=dirpath, filename=f"{{epoch}}-{{{metric}}}", **checkpoint_params
            )
            checkpointer.CHECKPOINT_NAME_LAST = f"{{epoch}}-best-{{{metric}}}"
            checkpointers.append(checkpointer)
        return checkpointers

Its quite strange because this error never used to happen before

Originally posted by @aaprasad in https://github.com/Lightning-AI/pytorch-lightning/discussions/19396

cc @carmocca @awaelchli

awaelchli commented 5 months ago

@aaprasad Does this error happen with the latest version of Lightning (2.1.4)?

aaprasad commented 5 months ago

Hi, yes it does. I'm currently using:

cuda-cudart               12.1.105                      0    nvidia
cuda-cupti                12.1.105                      0    nvidia
cuda-libraries            12.1.0                        0    nvidia
cuda-nvrtc                12.1.105                      0    nvidia
cuda-nvtx                 12.1.105                      0    nvidia
cuda-opencl               12.3.101                      0    nvidia
cuda-runtime              12.1.0                        0    nvidia
cudatoolkit               11.1.74              h6bb024c_0    nvidia
cudnn                     8.0.4                cuda11.1_0    nvidia
filelock                  3.13.1             pyhd8ed1ab_0    conda-forge
fsspec                    2024.2.0           pyhca7485f_0    conda-forge
hydra-core                1.3.2                    pypi_0    pypi
libcublas                 12.1.0.26                     0    nvidia
libcufft                  11.0.2.4                      0    nvidia
libcufile                 1.8.1.2                       0    nvidia
libcurand                 10.3.4.107                    0    nvidia
libcusolver               11.4.4.55                     0    nvidia
libcusparse               12.0.2.55                     0    nvidia
libnpp                    12.0.2.50                     0    nvidia
libnvjitlink              12.1.105                      0    nvidia
libnvjpeg                 12.1.1.14                     0    nvidia
lightning                 2.1.4              pyhd8ed1ab_0    conda-forge
lightning-utilities       0.10.1             pyhd8ed1ab_0    conda-forge
python                    3.9.18          h0755675_1_cpython    conda-forge
pytorch                   2.2.0           py3.9_cuda12.1_cudnn8.9.2_0    pytorch
pytorch-cuda              12.1                 ha16c6d3_5    pytorch
pytorch-lightning         2.1.3              pyhd8ed1ab_0    conda-forge
pytorch-mutex             1.0                        cuda    pytorch
torchmetrics              1.2.1              pyhd8ed1ab_0    conda-forge
torchtriton               2.2.0                      py39    pytorch
torchvision               0.17.0               py39_cu121    pytorch

In addition to a bunch of other misc packages (lmk if you need any other versions)

For some more context, these are the config params im using for checkpointing:

checkpointing: 
     monitor:  ['val_num_switches', 'val_loss']
     verbose: True
     save_last: True
     dirpath: '/home/jovyan/talmolab-smb/aadi/biogtr_expts/run/animal/SLAP_M74/models/burnt_pancake'
     auto_insert_metric_name: True
    every_n_epochs: 10
    save_top_k: -1

I also checked the checkpoint directory permissions with ls -lah and everything looks fine:

drwxrwxrwx 2 root root    0 Feb  5 18:36  ..
-rwxrwxrwx 1 root root 133M Feb  5 19:16 'epoch=0-best-val_loss=17.41611099243164.ckpt'
-rwxrwxrwx 1 root root 133M Feb  5 18:37 'epoch=0-best-val_loss=50.88465881347656.ckpt'
-rwxrwxrwx 1 root root 133M Feb  5 18:37 'epoch=0-best-val_num_switches=44.0.ckpt'
-rwxrwxrwx 1 root root 133M Feb  5 19:16 'epoch=0-best-val_num_switches=49.0.ckpt'
-rwxrwxrwx 1 root root 133M Feb  5 19:17 'epoch=1-best-val_num_switches=50.0.ckpt'
awaelchli commented 4 months ago

@aaprasad Is this still an issue with Lightning 2.2?