Open aaprasad opened 5 months ago
@aaprasad Does this error happen with the latest version of Lightning (2.1.4)?
Hi, yes it does. I'm currently using:
cuda-cudart 12.1.105 0 nvidia
cuda-cupti 12.1.105 0 nvidia
cuda-libraries 12.1.0 0 nvidia
cuda-nvrtc 12.1.105 0 nvidia
cuda-nvtx 12.1.105 0 nvidia
cuda-opencl 12.3.101 0 nvidia
cuda-runtime 12.1.0 0 nvidia
cudatoolkit 11.1.74 h6bb024c_0 nvidia
cudnn 8.0.4 cuda11.1_0 nvidia
filelock 3.13.1 pyhd8ed1ab_0 conda-forge
fsspec 2024.2.0 pyhca7485f_0 conda-forge
hydra-core 1.3.2 pypi_0 pypi
libcublas 12.1.0.26 0 nvidia
libcufft 11.0.2.4 0 nvidia
libcufile 1.8.1.2 0 nvidia
libcurand 10.3.4.107 0 nvidia
libcusolver 11.4.4.55 0 nvidia
libcusparse 12.0.2.55 0 nvidia
libnpp 12.0.2.50 0 nvidia
libnvjitlink 12.1.105 0 nvidia
libnvjpeg 12.1.1.14 0 nvidia
lightning 2.1.4 pyhd8ed1ab_0 conda-forge
lightning-utilities 0.10.1 pyhd8ed1ab_0 conda-forge
python 3.9.18 h0755675_1_cpython conda-forge
pytorch 2.2.0 py3.9_cuda12.1_cudnn8.9.2_0 pytorch
pytorch-cuda 12.1 ha16c6d3_5 pytorch
pytorch-lightning 2.1.3 pyhd8ed1ab_0 conda-forge
pytorch-mutex 1.0 cuda pytorch
torchmetrics 1.2.1 pyhd8ed1ab_0 conda-forge
torchtriton 2.2.0 py39 pytorch
torchvision 0.17.0 py39_cu121 pytorch
In addition to a bunch of other misc packages (lmk if you need any other versions)
For some more context, these are the config params im using for checkpointing:
checkpointing:
monitor: ['val_num_switches', 'val_loss']
verbose: True
save_last: True
dirpath: '/home/jovyan/talmolab-smb/aadi/biogtr_expts/run/animal/SLAP_M74/models/burnt_pancake'
auto_insert_metric_name: True
every_n_epochs: 10
save_top_k: -1
I also checked the checkpoint directory permissions with ls -lah
and everything looks fine:
drwxrwxrwx 2 root root 0 Feb 5 18:36 ..
-rwxrwxrwx 1 root root 133M Feb 5 19:16 'epoch=0-best-val_loss=17.41611099243164.ckpt'
-rwxrwxrwx 1 root root 133M Feb 5 18:37 'epoch=0-best-val_loss=50.88465881347656.ckpt'
-rwxrwxrwx 1 root root 133M Feb 5 18:37 'epoch=0-best-val_num_switches=44.0.ckpt'
-rwxrwxrwx 1 root root 133M Feb 5 19:16 'epoch=0-best-val_num_switches=49.0.ckpt'
-rwxrwxrwx 1 root root 133M Feb 5 19:17 'epoch=1-best-val_num_switches=50.0.ckpt'
@aaprasad Is this still an issue with Lightning 2.2?
Hi I'm trying to train a model and am getting this error:
This is how i set up my checkpoints:
Its quite strange because this error never used to happen before
Originally posted by @aaprasad in https://github.com/Lightning-AI/pytorch-lightning/discussions/19396
cc @carmocca @awaelchli