Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.36k stars 3.39k forks source link

Launch ddp on 8 devices, but only run on the first gpu #16236

Closed superhero-7 closed 1 year ago

superhero-7 commented 1 year ago

Bug description

I train the model like this,there are my code bellow:

trainer_kwargs["accelerator"] = 'gpu'
trainer_kwargs["devices"] = 8
trainer_kwargs["strategy"] = "ddp"
trainer = Trainer.from_argparse_args(trainer_config,**trainer_kwargs)
trainer.fit(model, data)

And it works fine, and didn't drow any error.But it didn't runing on 8 gpus,instead, it only runing on the first gpu. And only initializing one MEMBER like this: 1672801542612

I am so confuse,beacause the progress bar is totally right.The length of my dataset is 1198099,and in the progress bar, it shows 37457 steps one epoch, I set batch size to 4, so there is 4837457 almost equal to 11198099. image

But the question is, nvidia-smi only see the first gpu is runing,like bellow: image

I don't understand why this happend?I hope someone can help me,thanks a lot!!!!!

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment ``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): I try the lastest and 1.7.3, get the same question #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 1.10): 1.12.1 cuda 11.3 #- Python version (e.g., 3.9): 3.8.5 #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: A100*8 #- How you installed Lightning(`conda`, `pip`, source): pip install pytorch_lightning==1.7.3 #- Running environment of LightningApp (e.g. local, cloud): ```

More info

No response

cc @justusschock @awaelchli

dolortaste commented 1 year ago

got the same problem

superhero-7 commented 1 year ago

got the same problem

try

unset KUBERNETES_PORT

it works for me... I spend one night and one morning on it...TT There is a same problem link: https://github.com/Lightning-AI/lightning/issues/5254

dolortaste commented 1 year ago

unset KUBERNETES_PORT

Solved.. Thx

awaelchli commented 1 year ago

@superhero-7 Unfortunately I don't know how the KUBERNETES_PORT relates to this problem here, or even how it solved it. Does that mean this issue is closed, or are there still some open questions?

superhero-7 commented 1 year ago

@superhero-7 Unfortunately I don't know how the KUBERNETES_PORT relates to this problem here, or even how it solved it. Does that mean this issue is closed, or are there still some open questions?

Our machines are managed by k8s, I suppose maybe there are some conflicts about the GLOBAL RANK enviroment between k8s setting and pytorch_lightning ddp setting?

magehrig commented 1 year ago

I got the same issue but on a SLURM cluster. I have access to two SLURM clusters. Interestingly, on one cluster PL DDP works fine but on the second one, I experience this issue. Since I don't use K8s, unset KUBERNETES_PORT does not solve the issue.

I guess it would be really hard to reproduce this. Any pointers to what I could try?

awaelchli commented 1 year ago

You could try printing the os.environ at the beginning of the script and comparing it between the two nodes. See if any env variables are set that shouldn't or ones that are missing. You could also post the printout here if you like (but redact any sensitive information) so we can take a look.

Since you are using SLURM, make sure to follow exactly the instructions here.

magehrig commented 1 year ago

@awaelchli Great idea! I think I should have correctly followed the instructions. Since I use two different (SLURM) clusters they have a slightly different sbatch script but the rest is the same.

For this test, I use two GPUs on a single node.

First sbatch script for the server on which there are no issues:

#!/usr/bin/env bash
#SBATCH --parsable
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=16G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:2
#SBATCH --output=/some/path/%j.out

module load nccl
source /some/path/conda.sh
conda activate myenv
srun python myscript.py ...
conda deactivate

Second sbatch script for the server where I observe the described issue:

#!/usr/bin/env bash
#SBATCH --parsable
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=16G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=3
#SBATCH --gpus=rtx_3090:2
#SBATCH --output=/some/path/%j.out

module load nccl
source /some/path/conda.sh
conda activate myenv
srun python myscript.py ...
conda deactivate

Now, the os.environ output on the server where I observe no issues:

{'ACLOCAL_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/share/aclocal',
 'BASH_ENV': '/cluster/lmod-8.6.5/lmod/lmod/init/bash',
 'BASH_FUNC_ml%%': '() {  eval $($LMOD_DIR/ml_cmd "$@")\n}',
 'BASH_FUNC_module%%': '() {  eval $($LMOD_CMD bash "$@") && eval '
                       '$(${LMOD_SETTARG_CMD:-:} -s sh)\n'
                       '}',
 'CMAKE_PREFIX_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy',
 'CONDA_DEFAULT_ENV': 'rnn-st',
 'CONDA_EXE': '/data/user/programs/mambaforge/bin/conda',
 'CONDA_MKL_INTERFACE_LAYER_BACKUP': '',
 'CONDA_PREFIX': '/data/user/programs/mambaforge/envs/rnn-st',
 'CONDA_PROMPT_MODIFIER': '(rnn-st) ',
 'CONDA_PYTHON_EXE': '/data/user/programs/mambaforge/bin/python',
 'CONDA_SHLVL': '1',
 'CPATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/include',
 'CRC32C_SW_MODE': 'auto',
 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID',
 'CUDA_HOME': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh',
 'CUDA_VISIBLE_DEVICES': '0,1',
 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/891944109/bus',
 'DISPLAY': 'u20-login-1:16.0',
 'ENVIRONMENT': 'BATCH',
 'GPU_DEVICE_ORDINAL': '0,1',
 'HOME': '/home/user',
 'HOSTNAME': 'u20-computeibmgpu-vesta7',
 'LANG': 'C.UTF-8',
 'LC_ADDRESS': 'de_CH.UTF-8',
 'LC_IDENTIFICATION': 'de_CH.UTF-8',
 'LC_MEASUREMENT': 'de_CH.UTF-8',
 'LC_MONETARY': 'de_CH.UTF-8',
 'LC_NAME': 'de_CH.UTF-8',
 'LC_NUMERIC': 'de_CH.UTF-8',
 'LC_PAPER': 'de_CH.UTF-8',
 'LC_TELEPHONE': 'de_CH.UTF-8',
 'LC_TIME': 'de_CH.UTF-8',
 'LD_LIBRARY_PATH': '/data/user/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/../../lib64:/cluster/munge-0.5.14/lib:/cluster/slurm-20-11-8-1/lib:/cluster/pmix-4.1.2/lib:/cluster/libevent-2.1.12/lib',
 'LESS': '-R',
 'LESSCLOSE': '/usr/bin/lesspipe %s %s',
 'LESSOPEN': '| /usr/bin/lesspipe %s',
 'LIBRARY_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/lib64:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/lib',
 'LMOD_CMD': '/cluster/lmod-8.6.5/lmod/lmod/libexec/lmod',
 'LMOD_COLORIZE': 'yes',
 'LMOD_DIR': '/cluster/lmod-8.6.5/lmod/lmod/libexec',
 'LMOD_FAMILY_GRES': 'v100',
 'LMOD_FAMILY_GRES_VERSION': 'false',
 'LMOD_FAMILY_RESOURCE': 'multigpu',
 'LMOD_FAMILY_RESOURCE_VERSION': 'false',
 'LMOD_FULL_SETTARG_SUPPORT': 'no',
 'LMOD_MODULERCFILE': '/apps/etc/modules/.modulerc.lua',
 'LMOD_PACKAGE_PATH': '/cluster/lmod-8.6.5',
 'LMOD_PKG': '/cluster/lmod-8.6.5/lmod/lmod',
 'LMOD_PREPEND_BLOCK': 'normal',
 'LMOD_ROOT': '/cluster/lmod-8.6.5/lmod',
 'LMOD_SETTARG_CMD': ':',
 'LMOD_SETTARG_FULL_SUPPORT': 'no',
 'LMOD_VERSION': '8.6.5',
 'LMOD_arch': 'x86_64',
 'LMOD_sys': 'Linux',
 'LOADEDMODULES': 'v100:multigpu:libiconv/1.16-pdflaob:xz/5.2.5-mhrz5su:zlib/1.2.12-j4b6zeg:libxml2/2.9.12-koohqap:cuda/11.4.4-ldlywt5:libnl/3.3.0-qtnpjoa:rdma-core/41.0-hquyri7:nccl/2.11.4-1',
 'LOGNAME': 'user',
 'LSCOLORS': 'Gxfxcxdxbxegedabagacad',
 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:',
 'MANPATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/share/man:/cluster/lmod-8.6.5/lmod/lmod/share/man::/var/cfengine/share/man',
 'MKL_INTERFACE_LAYER': 'LP64,GNU',
 'MKL_NUM_THREADS': '1',
 'MODULEPATH': '/apps/etc/modules/multigpu:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core:/apps/etc/modules/system:/apps/etc/modules/containers:/apps/etc/modules/default:/apps/etc/modules/flavors',
 'MODULEPATH_ROOT': '/apps/etc/modules',
 'MODULESHOME': '/cluster/lmod-8.6.5/lmod/lmod',
 'MOTD_SHOWN': 'pam',
 'NUMEXPR_NUM_THREADS': '1',
 'OLDPWD': '/home/user',
 'OMP_NUM_THREADS': '1',
 'OPENBLAS_NUM_THREADS': '1',
 'OPENCV_OPENCL_RUNTIME': 'disabled',
 'PAGER': 'less',
 'PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/bin:/data/user/programs/mambaforge/envs/rnn-st/bin:/data/user/programs/mambaforge/condabin:/cluster/slurm-20-11-8-1/bin:/cluster/slurm-20-11-8-1/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/var/cfengine/bin:/usr/local/go/bin',
 'PKG_CONFIG_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/lib/pkgconfig',
 'PMI_FD': '13',
 'PMI_JOBID': '72348.0',
 'PMI_RANK': '1',
 'PMI_SIZE': '2',
 'PWD': '/data/user/code/rnn-st/scripts/slurm',
 'PYTORCH_NVML_BASED_CUDA_CHECK': '1',
 'QT_QPA_FONTDIR': '/data/user/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/qt/fonts',
 'QT_QPA_PLATFORM_PLUGIN_PATH': '/data/user/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/qt/plugins',
 'ROCR_VISIBLE_DEVICES': '0,1',
 'SACCT_FORMAT': 'jobid%-6,jobname,maxrss,maxvmsize,alloccpus,elapsed%12,state,exitcode%6',
 'SALLOC_CONSTRAINT': 'MULTIGPU',
 'SBATCH_CONSTRAINT': 'MULTIGPU',
 'SHELL': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zsh-5.9-cghjsxkx626zdkbcilxi3tk3nshivvo6/bin/zsh',
 'SHLVL': '3',
 'SLURMD_NODENAME': 'u20-computeibmgpu-vesta7',
 'SLURM_CLUSTER_NAME': 'cluster',
 'SLURM_CONF': '/cluster/slurm-20-11-8-1/etc/slurm.conf',
 'SLURM_CONSTRAINT': 'MULTIGPU',
 'SLURM_CPUS_ON_NODE': '4',
 'SLURM_CPUS_PER_TASK': '2',
 'SLURM_CPU_BIND': 'quiet,mask_cpu:0x020000020000,0x400000400000',
 'SLURM_CPU_BIND_LIST': '0x020000020000,0x400000400000',
 'SLURM_CPU_BIND_TYPE': 'mask_cpu:',
 'SLURM_CPU_BIND_VERBOSE': 'quiet',
 'SLURM_DISTRIBUTION': 'block',
 'SLURM_GTIDS': '0,1',
 'SLURM_JOBID': '72348',
 'SLURM_JOB_ACCOUNT': 'something',
 'SLURM_JOB_CPUS_PER_NODE': '4',
 'SLURM_JOB_CPUS_PER_NODE_PACK_GROUP_0': '4',
 'SLURM_JOB_GID': '891944109',
 'SLURM_JOB_GPUS': '5,6',
 'SLURM_JOB_ID': '72348',
 'SLURM_JOB_NAME': 'train.job',
 'SLURM_JOB_NODELIST': 'u20-computeibmgpu-vesta7',
 'SLURM_JOB_NUM_NODES': '1',
 'SLURM_JOB_PARTITION': 'standard',
 'SLURM_JOB_QOS': 'normal',
 'SLURM_JOB_UID': '891944109',
 'SLURM_JOB_USER': 'user',
 'SLURM_LAUNCH_NODE_IPADDR': '10.129.48.36',
 'SLURM_LOCALID': '1',
 'SLURM_MEM_PER_CPU': '16384',
 'SLURM_MPI_TYPE': 'pmi2',
 'SLURM_NNODES': '1',
 'SLURM_NODEID': '0',
 'SLURM_NODELIST': 'u20-computeibmgpu-vesta7',
 'SLURM_NODE_ALIASES': '(null)',
 'SLURM_NPROCS': '2',
 'SLURM_NTASKS': '2',
 'SLURM_NTASKS_PER_NODE': '2',
 'SLURM_PRIO_PROCESS': '0',
 'SLURM_PROCID': '1',
 'SLURM_SRUN_COMM_HOST': '10.129.48.36',
 'SLURM_SRUN_COMM_PORT': '39247',
 'SLURM_STEPID': '0',
 'SLURM_STEP_GPUS': '5,6',
 'SLURM_STEP_ID': '0',
 'SLURM_STEP_LAUNCHER_PORT': '39247',
 'SLURM_STEP_NODELIST': 'u20-computeibmgpu-vesta7',
 'SLURM_STEP_NUM_NODES': '1',
 'SLURM_STEP_NUM_TASKS': '2',
 'SLURM_STEP_RESV_PORTS': '12585-12587',
 'SLURM_STEP_TASKS_PER_NODE': '2',
 'SLURM_SUBMIT_DIR': '/data/user/code/rnn-st/scripts/slurm',
 'SLURM_SUBMIT_HOST': 'u20-computeibmgpu-vesta7',
 'SLURM_TASKS_PER_NODE': '2',
 'SLURM_TASK_PID': '577439',
 'SLURM_TOPOLOGY_ADDR': 'u20-computeibmgpu-vesta7',
 'SLURM_TOPOLOGY_ADDR_PATTERN': 'node',
 'SLURM_UMASK': '0002',
 'SLURM_WORKING_CLUSTER': 'cluster:u20-controller.hydra:6817:9216:109',
 'SPACK_ROOT': '/apps',
 'SQUEUE_FORMAT': '%8i %7u %12T %.3C %.6m %.12M %20e %R',
 'SRUN_DEBUG': '3',
 'SSH_AGENT_PID': '20309',
 'SSH_AUTH_SOCK': '/tmp/ssh-TLK7wup2nTiA/agent.20307',
 'SSH_CLIENT': '195.176.113.242 35588 22',
 'SSH_CONNECTION': '195.176.113.235 32866 172.16.0.75 22',
 'SSH_TTY': '/dev/pts/0',
 'TERM': 'tmux-256color',
 'TMPDIR': '/data/user/tmp/72348',
 'TMUX': '/tmp//tmux-891944109/default,132182,0',
 'TMUX_PANE': '%41',
 'TMUX_PLUGIN_MANAGER_PATH': '/home/user/.tmux/plugins/',
 'USER': 'user',
 'VECLIB_MAXIMUM_THREADS': '1',
 'WANDB_REQUIRE_SERVICE': 'True',
 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop',
 'XDG_RUNTIME_DIR': '/run/user/891944109',
 'XDG_SESSION_CLASS': 'user',
 'XDG_SESSION_ID': '511',
 'XDG_SESSION_TYPE': 'tty',
 'ZSH': '/home/user/.myconfig/zsh/oh-my-zsh',
 'ZSH_TMUX_CONFIG': '/home/user/.tmux.conf',
 'ZSH_TMUX_TERM': 'screen-256color',
 '_': '/cluster/slurm-20-11-8-1/bin/srun',
 '_CE_CONDA': '',
 '_CE_M': '',
 '_LMFILES_': '/apps/etc/modules/flavors/v100.lua:/apps/etc/modules/flavors/multigpu.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libiconv/1.16-pdflaob.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/xz/5.2.5-mhrz5su.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/zlib/1.2.12-j4b6zeg.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libxml2/2.9.12-koohqap.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/cuda/11.4.4-ldlywt5.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libnl/3.3.0-qtnpjoa.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/rdma-core/41.0-hquyri7.lua:/apps/etc/modules/multigpu/nccl/2.11.4-1.lua',
 '_ModuleTable001_': 'X01vZHVsZVRhYmxlXyA9IHsKTVR2ZXJzaW9uID0gMywKY19yZWJ1aWxkVGltZSA9IDcyMDAuMCwKY19zaG9ydFRpbWUgPSAwLjM5MTMyNDk5Njk0ODI0LApkZXB0aFQgPSB7fSwKZmFtaWx5ID0gewpncmVzID0gInYxMDAiLApyZXNvdXJjZSA9ICJtdWx0aWdwdSIsCn0sCm1UID0gewpjdWRhID0gewpmbiA9ICIvYXBwcy9zaGFyZS9zcGFjay9sbW9kL2xpbnV4LXVidW50dTIwLjA0LXg4Nl82NC9Db3JlL2N1ZGEvMTEuNC40LWxkbHl3dDUubHVhIiwKZnVsbE5hbWUgPSAiY3VkYS8xMS40LjQtbGRseXd0NSIsCmxvYWRPcmRlciA9IDcsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAiY3VkYS8xMS40LjQtbGRseXd0NSIs',
 '_ModuleTable002_': 'CndWID0gIjAwMDAwMDAxMS4wMDAwMDAwMDQuMDAwMDAwMDA0LipsZGx5d3QuMDAwMDAwMDA1Lip6ZmluYWwiLAp9LApsaWJpY29udiA9IHsKZm4gPSAiL2FwcHMvc2hhcmUvc3BhY2svbG1vZC9saW51eC11YnVudHUyMC4wNC14ODZfNjQvQ29yZS9saWJpY29udi8xLjE2LXBkZmxhb2IubHVhIiwKZnVsbE5hbWUgPSAibGliaWNvbnYvMS4xNi1wZGZsYW9iIiwKbG9hZE9yZGVyID0gMywKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJsaWJpY29udi8xLjE2LXBkZmxhb2IiLAp3ViA9ICIwMDAwMDAwMDEuMDAwMDAwMDE2LipkZmxhb2IuKnpmaW5hbCIsCn0sCmxpYm5sID0gewpmbiA9ICIvYXBwcy9zaGFyZS9zcGFjay9s',
 '_ModuleTable003_': 'bW9kL2xpbnV4LXVidW50dTIwLjA0LXg4Nl82NC9Db3JlL2xpYm5sLzMuMy4wLXF0bnBqb2EubHVhIiwKZnVsbE5hbWUgPSAibGlibmwvMy4zLjAtcXRucGpvYSIsCmxvYWRPcmRlciA9IDgsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAibGlibmwvMy4zLjAtcXRucGpvYSIsCndWID0gIjAwMDAwMDAwMy4wMDAwMDAwMDMuKnF0bnBqb2EuKnpmaW5hbCIsCn0sCmxpYnhtbDIgPSB7CmZuID0gIi9hcHBzL3NoYXJlL3NwYWNrL2xtb2QvbGludXgtdWJ1bnR1MjAuMDQteDg2XzY0L0NvcmUvbGlieG1sMi8yLjkuMTIta29vaHFhcC5sdWEiLApmdWxsTmFtZSA9ICJsaWJ4bWwyLzIuOS4xMi1rb29ocWFwIiwKbG9hZE9yZGVy',
 '_ModuleTable004_': 'ID0gNiwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJsaWJ4bWwyLzIuOS4xMi1rb29ocWFwIiwKd1YgPSAiMDAwMDAwMDAyLjAwMDAwMDAwOS4wMDAwMDAwMTIuKmtvb2hxYXAuKnpmaW5hbCIsCn0sCm11bHRpZ3B1ID0gewpmbiA9ICIvYXBwcy9ldGMvbW9kdWxlcy9mbGF2b3JzL211bHRpZ3B1Lmx1YSIsCmZ1bGxOYW1lID0gIm11bHRpZ3B1IiwKbG9hZE9yZGVyID0gMiwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJtdWx0aWdwdSIsCndWID0gIk0uKnpmaW5hbCIsCn0sCm5jY2wgPSB7CmZuID0gIi9hcHBzL2V0Yy9tb2R1bGVzL211bHRpZ3B1L25jY2wv',
 '_ModuleTable005_': 'Mi4xMS40LTEubHVhIiwKZnVsbE5hbWUgPSAibmNjbC8yLjExLjQtMSIsCmxvYWRPcmRlciA9IDEwLApwcm9wVCA9IHt9LApzdGFja0RlcHRoID0gMCwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gIm5jY2wiLAp3ViA9ICIwMDAwMDAwMDIuMDAwMDAwMDExLjAwMDAwMDAwNC4qemZpbmFsLS4wMDAwMDAwMDEuKnpmaW5hbCIsCn0sClsicmRtYS1jb3JlIl0gPSB7CmZuID0gIi9hcHBzL3NoYXJlL3NwYWNrL2xtb2QvbGludXgtdWJ1bnR1MjAuMDQteDg2XzY0L0NvcmUvcmRtYS1jb3JlLzQxLjAtaHF1eXJpNy5sdWEiLApmdWxsTmFtZSA9ICJyZG1hLWNvcmUvNDEuMC1ocXV5cmk3IiwKbG9hZE9yZGVyID0gOSwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1',
 '_ModuleTable006_': 'cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJyZG1hLWNvcmUvNDEuMC1ocXV5cmk3IiwKd1YgPSAiMDAwMDAwMDQxLipocXV5cmkuMDAwMDAwMDA3Lip6ZmluYWwiLAp9LAp2MTAwID0gewpmbiA9ICIvYXBwcy9ldGMvbW9kdWxlcy9mbGF2b3JzL3YxMDAubHVhIiwKZnVsbE5hbWUgPSAidjEwMCIsCmxvYWRPcmRlciA9IDEsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAidjEwMCIsCndWID0gIk0uKnpmaW5hbCIsCn0sCnh6ID0gewpmbiA9ICIvYXBwcy9zaGFyZS9zcGFjay9sbW9kL2xpbnV4LXVidW50dTIwLjA0LXg4Nl82NC9Db3JlL3h6LzUuMi41LW1ocno1c3UubHVhIiwKZnVsbE5hbWUgPSAieHovNS4yLjUtbWhy',
 '_ModuleTable007_': 'ejVzdSIsCmxvYWRPcmRlciA9IDQsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAieHovNS4yLjUtbWhyejVzdSIsCndWID0gIjAwMDAwMDAwNS4wMDAwMDAwMDIuMDAwMDAwMDA1LiptaHJ6LjAwMDAwMDAwNS4qc3UuKnpmaW5hbCIsCn0sCnpsaWIgPSB7CmZuID0gIi9hcHBzL3NoYXJlL3NwYWNrL2xtb2QvbGludXgtdWJ1bnR1MjAuMDQteDg2XzY0L0NvcmUvemxpYi8xLjIuMTItajRiNnplZy5sdWEiLApmdWxsTmFtZSA9ICJ6bGliLzEuMi4xMi1qNGI2emVnIiwKbG9hZE9yZGVyID0gNSwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJ6bGliLzEuMi4xMi1q',
 '_ModuleTable008_': 'NGI2emVnIiwKd1YgPSAiMDAwMDAwMDAxLjAwMDAwMDAwMi4wMDAwMDAwMTIuKmouMDAwMDAwMDA0LipiLjAwMDAwMDAwNi4qemVnLip6ZmluYWwiLAp9LAp9LAptcGF0aEEgPSB7CiIvYXBwcy9ldGMvbW9kdWxlcy9tdWx0aWdwdSIKLCAiL2FwcHMvc2hhcmUvc3BhY2svbG1vZC9saW51eC11YnVudHUyMC4wNC14ODZfNjQvQ29yZSIKLCAiL2FwcHMvZXRjL21vZHVsZXMvc3lzdGVtIiwgIi9hcHBzL2V0Yy9tb2R1bGVzL2NvbnRhaW5lcnMiCiwgIi9hcHBzL2V0Yy9tb2R1bGVzL2RlZmF1bHQiLCAiL2FwcHMvZXRjL21vZHVsZXMvZmxhdm9ycyIsCn0sCnN5c3RlbUJhc2VNUEFUSCA9ICIvYXBwcy9zaGFyZS9zcGFjay9sbW9kL2xpbnV4LXVidW50dTIwLjA0LXg4Nl82NC9Db3Jl',
 '_ModuleTable009_': 'Oi9hcHBzL2V0Yy9tb2R1bGVzL3N5c3RlbTovYXBwcy9ldGMvbW9kdWxlcy9jb250YWluZXJzOi9hcHBzL2V0Yy9tb2R1bGVzL2RlZmF1bHQ6L2FwcHMvZXRjL21vZHVsZXMvZmxhdm9ycyIsCn0K',
 '_ModuleTable_Sz_': '9',
 '_ZSH_TMUX_FIXED_CONFIG': '/home/user/.myconfig/zsh/oh-my-zsh/plugins/tmux/tmux.extra.conf',
 '__LMOD_REF_COUNT_ACLOCAL_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/share/aclocal:2',
 '__LMOD_REF_COUNT_CMAKE_PREFIX_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy:2',
 '__LMOD_REF_COUNT_CPATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/include:1',
 '__LMOD_REF_COUNT_LIBRARY_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/lib64:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/lib:1',
 '__LMOD_REF_COUNT_LOADEDMODULES': 'v100:1;multigpu:1;libiconv/1.16-pdflaob:1;xz/5.2.5-mhrz5su:1;zlib/1.2.12-j4b6zeg:1;libxml2/2.9.12-koohqap:1;cuda/11.4.4-ldlywt5:1;libnl/3.3.0-qtnpjoa:1;rdma-core/41.0-hquyri7:1;nccl/2.11.4-1:1',
 '__LMOD_REF_COUNT_MANPATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/share/man:2;/cluster/lmod-8.6.5/lmod/lmod/share/man:1;/var/cfengine/share/man:1',
 '__LMOD_REF_COUNT_MODULEPATH': '/apps/etc/modules/multigpu:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core:1;/apps/etc/modules/system:1;/apps/etc/modules/containers:1;/apps/etc/modules/default:1;/apps/etc/modules/flavors:1',
 '__LMOD_REF_COUNT_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/bin:2;/data/user/programs/mambaforge/envs/rnn-st/bin:1;/data/user/programs/mambaforge/condabin:1;/cluster/slurm-20-11-8-1/bin:1;/cluster/slurm-20-11-8-1/sbin:1;/usr/local/sbin:1;/usr/local/bin:1;/usr/sbin:1;/usr/bin:1;/sbin:1;/bin:1;/usr/games:1;/usr/local/games:1;/snap/bin:1;/var/cfengine/bin:1;/usr/local/go/bin:3',
 '__LMOD_REF_COUNT_PKG_CONFIG_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/lib/pkgconfig:2',
 '__LMOD_REF_COUNT__LMFILES_': '/apps/etc/modules/flavors/v100.lua:1;/apps/etc/modules/flavors/multigpu.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libiconv/1.16-pdflaob.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/xz/5.2.5-mhrz5su.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/zlib/1.2.12-j4b6zeg.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libxml2/2.9.12-koohqap.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/cuda/11.4.4-ldlywt5.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libnl/3.3.0-qtnpjoa.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/rdma-core/41.0-hquyri7.lua:1;/apps/etc/modules/multigpu/nccl/2.11.4-1.lua:1',
 '__LMOD_SET_FPATH': '1',
 'ftp_proxy': 'http://wtp.hydra:8080',
 'http_proxy': 'http://wtp.hydra:8080',
 'https_proxy': 'http://wtp.hydra:8080',
 'no_proxy': 'localhost,127.0.0.1,10.129.60.84,.hydra,.int,',
 'tmux_version': '3.0'}

The os.environ output on the server where I observe the described issue:

{'BASH_ENV': '/cluster/apps/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/init/bash',
 'BASH_FUNC_ml%%': '() {  eval $($LMOD_DIR/ml_cmd "$@")\n}',
 'BASH_FUNC_ml()': '() {  eval $($LMOD_DIR/ml_cmd "$@")\n}',
 'BASH_FUNC_module%%': '() {  eval $($LMOD_CMD bash "$@") && eval '
                       '$(${LMOD_SETTARG_CMD:-:} -s sh)\n'
                       '}',
 'BASH_FUNC_module()': '() {  eval $($LMOD_CMD bash "$@") && eval '
                       '$(${LMOD_SETTARG_CMD:-:} -s sh)\n'
                       '}',
 'CC': '/usr/bin/gcc',
 'CMAKE_PREFIX_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3:/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa',
 'CONDA_DEFAULT_ENV': 'rnn-st',
 'CONDA_EXE': '/cluster/project/lab/me/programs/mambaforge/bin/conda',
 'CONDA_MKL_INTERFACE_LAYER_BACKUP': '',
 'CONDA_PREFIX': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st',
 'CONDA_PROMPT_MODIFIER': '(rnn-st) ',
 'CONDA_PYTHON_EXE': '/cluster/project/lab/me/programs/mambaforge/bin/python',
 'CONDA_SHLVL': '1',
 'CONSUL_HTTP_ADDR': 'unix:///var/run/consul/http.sock',
 'CPATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/include',
 'CPP': '/usr/bin/cpp',
 'CRC32C_SW_MODE': 'auto',
 'CUDA_DEVICE_ORDER': 'PCI_BUS_ID',
 'CUDA_VISIBLE_DEVICES': '0,1',
 'CXX': '/usr/bin/g++',
 'DISPLAY': 'localhost:11.0',
 'ENVIRONMENT': 'BATCH',
 'F77': '/usr/bin/gfortran',
 'F90': '/usr/bin/gfortran',
 'FC': '/usr/bin/gfortran',
 'HISTCONTROL': 'ignoredups',
 'HISTSIZE': '50000',
 'HOME': '/cluster/home/user',
 'HOSTNAME': 'eu-g4-015',
 'I_MPI_PMI_LIBRARY': '/cluster/apps/slurm/lib/libpmi2.so',
 'LANG': 'en_US.UTF-8',
 'LD_LIBRARY_PATH': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/../../lib64:/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/lib:/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/lib::',
 'LESS': '-R',
 'LESSOPEN': '||/usr/bin/lesspipe.sh %s',
 'LIBGL_ALWAYS_INDIRECT': '1',
 'LIBRARY_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/lib',
 'LMOD_CMD': '/cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/libexec/lmod',
 'LMOD_DIR': '/cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/libexec',
 'LMOD_FAMILY_COMPILER': 'gcc',
 'LMOD_FAMILY_COMPILER_VERSION': '4.8.5',
 'LMOD_PKG': '/cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod',
 'LMOD_SETTARG_FULL_SUPPORT': 'no',
 'LMOD_SYSTEM_DEFAULT_MODULES': 'StdEnv:gcc/4.8.5',
 'LMOD_VERSION': '7.7.13',
 'LMOD_sys': 'Linux',
 'LOADEDMODULES': 'StdEnv:gcc/4.8.5:zsh/5.8:tmux/3.2a:proxy:nccl/2.11.4-1',
 'LOGNAME': 'user',
 'LSCOLORS': 'Gxfxcxdxbxegedabagacad',
 'LSF_BINDIR': '/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/bin',
 'LSF_ENVDIR': '/cluster/apps/lsf/conf',
 'LSF_LIBDIR': '/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/lib',
 'LSF_SERVERDIR': '/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/etc',
 'LS_COLORS': 'rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:*.tar=38;5;9:*.tgz=38;5;9:*.arc=38;5;9:*.arj=38;5;9:*.taz=38;5;9:*.lha=38;5;9:*.lz4=38;5;9:*.lzh=38;5;9:*.lzma=38;5;9:*.tlz=38;5;9:*.txz=38;5;9:*.tzo=38;5;9:*.t7z=38;5;9:*.zip=38;5;9:*.z=38;5;9:*.Z=38;5;9:*.dz=38;5;9:*.gz=38;5;9:*.lrz=38;5;9:*.lz=38;5;9:*.lzo=38;5;9:*.xz=38;5;9:*.bz2=38;5;9:*.bz=38;5;9:*.tbz=38;5;9:*.tbz2=38;5;9:*.tz=38;5;9:*.deb=38;5;9:*.rpm=38;5;9:*.jar=38;5;9:*.war=38;5;9:*.ear=38;5;9:*.sar=38;5;9:*.rar=38;5;9:*.alz=38;5;9:*.ace=38;5;9:*.zoo=38;5;9:*.cpio=38;5;9:*.7z=38;5;9:*.rz=38;5;9:*.cab=38;5;9:*.jpg=38;5;13:*.jpeg=38;5;13:*.gif=38;5;13:*.bmp=38;5;13:*.pbm=38;5;13:*.pgm=38;5;13:*.ppm=38;5;13:*.tga=38;5;13:*.xbm=38;5;13:*.xpm=38;5;13:*.tif=38;5;13:*.tiff=38;5;13:*.png=38;5;13:*.svg=38;5;13:*.svgz=38;5;13:*.mng=38;5;13:*.pcx=38;5;13:*.mov=38;5;13:*.mpg=38;5;13:*.mpeg=38;5;13:*.m2v=38;5;13:*.mkv=38;5;13:*.webm=38;5;13:*.ogm=38;5;13:*.mp4=38;5;13:*.m4v=38;5;13:*.mp4v=38;5;13:*.vob=38;5;13:*.qt=38;5;13:*.nuv=38;5;13:*.wmv=38;5;13:*.asf=38;5;13:*.rm=38;5;13:*.rmvb=38;5;13:*.flc=38;5;13:*.avi=38;5;13:*.fli=38;5;13:*.flv=38;5;13:*.gl=38;5;13:*.dl=38;5;13:*.xcf=38;5;13:*.xwd=38;5;13:*.yuv=38;5;13:*.cgm=38;5;13:*.emf=38;5;13:*.axv=38;5;13:*.anx=38;5;13:*.ogv=38;5;13:*.ogx=38;5;13:*.aac=38;5;45:*.au=38;5;45:*.flac=38;5;45:*.mid=38;5;45:*.midi=38;5;45:*.mka=38;5;45:*.mp3=38;5;45:*.mpc=38;5;45:*.ogg=38;5;45:*.ra=38;5;45:*.wav=38;5;45:*.axa=38;5;45:*.oga=38;5;45:*.spx=38;5;45:*.xspf=38;5;45:',
 'MAIL': '/var/spool/mail/user',
 'MANPATH': '/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf/share/man:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/share/man:/cluster/apps/sfos/share/man/man1:/cluster/apps/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/share/man:/cluster/apps/lsf/10.1/man::',
 'MKL_INTERFACE_LAYER': 'LP64,GNU',
 'MKL_NUM_THREADS': '1',
 'MODULEPATH': '/cluster/apps/lmodules/Compiler/gcc/4.8.5:/cluster/apps/lmodules/Linux:/cluster/apps/lmodules/Core',
 'MODULEPATH_ROOT': '/cluster/apps/lmodules',
 'MODULESHOME': '/cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod',
 'NCCL_ROOT': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3',
 'NUMEXPR_NUM_THREADS': '1',
 'OLDPWD': '/cluster/project/lab/me/code/rnn-st',
 'OMP_NUM_THREADS': '1',
 'OPENBLAS_NUM_THREADS': '1',
 'OPENCV_OPENCL_RUNTIME': 'disabled',
 'PAGER': 'less',
 'PATH': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/bin:/cluster/project/lab/me/programs/mambaforge/condabin:/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf/bin:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/bin:/cluster/apps/local:/cluster/apps/sfos/bin:/cluster/apps/slurm/bin:/usr/lib64/qt-3.3/bin:/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/cluster/home/user/.local/bin:/cluster/home/user/bin:/usr/local/go/bin:/usr/local/go/bin',
 'PKG_CONFIG_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib/pkgconfig',
 'PMI_FD': '10',
 'PMI_JOBID': '7099018.0',
 'PMI_RANK': '0',
 'PMI_SIZE': '1',
 'PWD': '/cluster/project/lab/me/code/rnn-st/scripts/slurm',
 'PYTORCH_NVML_BASED_CUDA_CHECK': '1',
 'QTDIR': '/usr/lib64/qt-3.3',
 'QTINC': '/usr/lib64/qt-3.3/include',
 'QTLIB': '/usr/lib64/qt-3.3/lib',
 'QT_GRAPHICSSYSTEM_CHECKED': '1',
 'QT_QPA_FONTDIR': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/qt/fonts',
 'QT_QPA_PLATFORM_PLUGIN_PATH': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/qt/plugins',
 'SCRATCH': '/cluster/scratch/user',
 'SHELL': '/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/bin/zsh',
 'SHLVL': '3',
 'SHOST': 'eu-login-41',
 'SLURMD_NODENAME': 'eu-g4-015',
 'SLURM_CLUSTER_NAME': 'cluster',
 'SLURM_CONF': '/cluster/slurm/adm/etc/slurm.conf',
 'SLURM_CPUS_ON_NODE': '3',
 'SLURM_CPUS_PER_TASK': '3',
 'SLURM_CPU_BIND_LIST': '0x0000000000000000000000000000001C',
 'SLURM_CPU_BIND_TYPE': 'mask_cpu:',
 'SLURM_CPU_BIND_VERBOSE': 'quiet',
 'SLURM_CPU_Bwandb: IND': 'quiet,mask_cpu:0x0000000000000000000000000000001C',
 'SLURM_DISTRIBUTION': 'cyclic',
 'SLURM_GPUS': 'nvidia_geforce_rtx_3090:2',
 'SLURM_GPUS_ON_NODE': '2',
 'SLURM_GTIDS': '0',
 'SLURM_JOBID': '7099018',
 'SLURM_JOB_ACCOUNT': 'gpuhe/es_scara',
 'SLURM_JOB_CPUS_PER_NODE': '3',
 'SLURM_JOB_CPUS_PER_NODE_PACK_GROUP_0': '3',
 'SLURM_JOB_GID': '476131',
 'SLURM_JOB_GPUS': '2,3',
 'SLURM_JOB_ID': '7099018',
 'SLURM_JOB_NAME': 'train.job',
 'SLURM_JOB_NODELIST': 'eu-g4-015',
 'SLURM_JOB_NUM_NODES': '1',
 'SLURM_JOB_PARTITION': 'gpuhe.120h',
 'SLURM_JOB_QOS': 'es_scara/gpuhe',
 'SLURM_JOB_UID': '575154',
 'SLURM_JOB_USER': 'user',
 'SLURM_LAUNCH_NODE_IPADDR': '10.205.100.15',
 'SLURM_LOCALID': '0',
 'SLURM_MEM_PER_CPU': '32768',
 'SLURM_MPI_TYPE': 'pmi2',
 'SLURM_NNODES': '1',
 'SLURM_NODEID': '0',
 'SLURM_NODELIST': 'eu-g4-015',
 'SLURM_NODE_ALIASES': '(null)',
 'SLURM_NPROCS': '1',
 'SLURM_NTASKS': '1',
 'SLURM_NTASKS_PER_NODE': '2',
 'SLURM_PRIO_PROCESS': '0',
 'SLURM_PROCID': '0',
 'SLURM_SCRIPT_CONTEXT': 'prolog_task',
 'SLURM_SRUN_COMM_HOST': '10.205.100.15',
 'SLURM_SRUN_COMM_PORT': '40015',
 'SLURM_STEPID': '0',
 'SLURM_STEP_GPUS': '2,3',
 'SLURM_STEP_ID': '0',
 'SLURM_STEP_LAUNCHER_PORT': '40015',
 'SLURM_STEP_NODELIST': 'eu-g4-015',
 'SLURM_STEP_NUM_NODES': '1',
 'SLURM_STEP_NUM_TASKS': '1',
 'SLURM_STEP_TASKS_PER_NODE': '1',
 'SLURM_SUBMIT_DIR': '/cluster/project/lab/me/code/rnn-st/scripts/slurm',
 'SLURM_SUBMIT_HOST': 'eu-login-41',
 'SLURM_TASKS_PER_NODE': '1',
 'SLURM_TASK_PID': '122949',
 'SLURM_TOPOLOGY_ADDR': '.cluster_gpuhe.eu-g4-015',
 'SLURM_TOPOLOGY_ADDR_PATTERN': 'switch.switch.node',
 'SLURM_UMASK': '0027',
 'SLURM_WORKING_CLUSTER': 'cluster:10.205.212.30:6817:9728:109',
 'SRUN_DEBUG': '3',
 'SSH_AGENT_PID': '12110',
 'SSH_AUTH_SOCK': '/tmp/ssh-kaM5RUgK0O4U/agent.12108',
 'SSH_CLIENT': '10.6.209.217 33612 22',
 'SSH_CONNECTION': '10.6.208.201 54404 129.132.93.116 22',
 'SSH_TTY': '/dev/pts/8',
 'TERM': 'tmux-256color',
 'TERM_PROGRAM': 'tmux',
 'TERM_PROGRAM_VERSION': '3.2a',
 'TMOUT': '86400',
 'TMPDIR': '/scratch/tmp.7099018.user',
 'TMUX': '/tmp/tmux-575154/default,12176,0',
 'TMUX_PANE': '%15',
 'TMUX_PLUGIN_MANAGER_PATH': '/cluster/home/user/.tmux/plugins/',
 'TMUX_ROOT': '/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf',
 'USER': 'user',
 'VECLIB_MAXIMUM_THREADS': '1',
 'WANDB_REQUIRE_SERVICE': 'True',
 'XDG_RUNTIME_DIR': '/run/user/575154',
 'XDG_SESSION_ID': '4695',
 'XML_CATALOG_FILES': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/etc/xml/catalog',
 'ZSH': '/cluster/home/user/.myconfig/zsh/oh-my-zsh',
 'ZSH_ROOT': '/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa',
 'ZSH_TMUX_CONFIG': '/cluster/home/user/.tmux.conf',
 'ZSH_TMUX_TERM': 'screen-256color',
 '_': '/cluster/apps/slurm/bin/srun',
 '_CE_CONDA': '',
 '_CE_M': '',
 '_LMFILES_': '/cluster/apps/lmodules/Core/StdEnv.lua:/cluster/apps/lmodules/Core/gcc/4.8.5.lua:/cluster/apps/lmodules/Compiler/gcc/4.8.5/zsh/5.8.lua:/cluster/apps/lmodules/Compiler/gcc/4.8.5/tmux/3.2a.lua:/cluster/apps/lmodules/Core/proxy.lua:/cluster/apps/lmodules/Compiler/gcc/4.8.5/nccl/2.11.4-1.lua',
 '_ModuleTable001_': 'X01vZHVsZVRhYmxlXz17WyJNVHZlcnNpb24iXT0zLFsiY19yZWJ1aWxkVGltZSJdPWZhbHNlLFsiY19zaG9ydFRpbWUiXT1mYWxzZSxkZXB0aFQ9e30sZmFtaWx5PXtbImNvbXBpbGVyIl09ImdjYyIsfSxtVD17U3RkRW52PXtbImZuIl09Ii9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29yZS9TdGRFbnYubHVhIixbImZ1bGxOYW1lIl09IlN0ZEVudiIsWyJsb2FkT3JkZXIiXT0xLHByb3BUPXt9LFsic3RhY2tEZXB0aCJdPTAsWyJzdGF0dXMiXT0iYWN0aXZlIixbInVzZXJOYW1lIl09IlN0ZEVudiIsfSxldGhfcHJveHk9e1siZm4iXT0iL2NsdXN0ZXIvYXBwcy9sbW9kdWxlcy9Db3JlL2V0aF9wcm94eS5sdWEiLFsiZnVsbE5hbWUiXT0iZXRoX3Byb3h5IixbImxvYWRPcmRlciJd',
 '_ModuleTable002_': 'PTUscHJvcFQ9e30sWyJzdGFja0RlcHRoIl09MCxbInN0YXR1cyJdPSJhY3RpdmUiLFsidXNlck5hbWUiXT0iZXRoX3Byb3h5Iix9LGdjYz17WyJmbiJdPSIvY2x1c3Rlci9hcHBzL2xtb2R1bGVzL0NvcmUvZ2NjLzQuOC41Lmx1YSIsWyJmdWxsTmFtZSJdPSJnY2MvNC44LjUiLFsibG9hZE9yZGVyIl09Mixwcm9wVD17fSxbInN0YWNrRGVwdGgiXT0wLFsic3RhdHVzIl09ImFjdGl2ZSIsWyJ1c2VyTmFtZSJdPSJnY2MvNC44LjUiLH0sbmNjbD17WyJmbiJdPSIvY2x1c3Rlci9hcHBzL2xtb2R1bGVzL0NvbXBpbGVyL2djYy80LjguNS9uY2NsLzIuMTEuNC0xLmx1YSIsWyJmdWxsTmFtZSJdPSJuY2NsLzIuMTEuNC0xIixbImxvYWRPcmRlciJdPTYscHJvcFQ9e30sWyJzdGFja0Rl',
 '_ModuleTable003_': 'cHRoIl09MCxbInN0YXR1cyJdPSJhY3RpdmUiLFsidXNlck5hbWUiXT0ibmNjbCIsfSx0bXV4PXtbImZuIl09Ii9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29tcGlsZXIvZ2NjLzQuOC41L3RtdXgvMy4yYS5sdWEiLFsiZnVsbE5hbWUiXT0idG11eC8zLjJhIixbImxvYWRPcmRlciJdPTQscHJvcFQ9e30sWyJzdGFja0RlcHRoIl09MCxbInN0YXR1cyJdPSJhY3RpdmUiLFsidXNlck5hbWUiXT0idG11eCIsfSx6c2g9e1siZm4iXT0iL2NsdXN0ZXIvYXBwcy9sbW9kdWxlcy9Db21waWxlci9nY2MvNC44LjUvenNoLzUuOC5sdWEiLFsiZnVsbE5hbWUiXT0ienNoLzUuOCIsWyJsb2FkT3JkZXIiXT0zLHByb3BUPXt9LFsic3RhY2tEZXB0aCJdPTAsWyJzdGF0dXMiXT0iYWN0aXZlIixb',
 '_ModuleTable004_': 'InVzZXJOYW1lIl09InpzaCIsfSx9LG1wYXRoQT17Ii9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29tcGlsZXIvZ2NjLzQuOC41IiwiL2NsdXN0ZXIvYXBwcy9sbW9kdWxlcy9MaW51eCIsIi9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29yZSIsfSxbInN5c3RlbUJhc2VNUEFUSCJdPSIvY2x1c3Rlci9hcHBzL2xtb2R1bGVzL0xpbnV4Oi9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29yZSIsfQ==',
 '_ModuleTable_Sz_': '4',
 '_ZSH_TMUX_FIXED_CONFIG': '/cluster/home/user/.myconfig/zsh/oh-my-zsh/plugins/tmux/tmux.extra.conf',
 '__Init_Default_Modules': '1',
 '__LMOD_REF_COUNT_CMAKE_PREFIX_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3:1;/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa:1',
 '__LMOD_REF_COUNT_CPATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/include:1',
 '__LMOD_REF_COUNT_LD_LIBRARY_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/lib:1;/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/lib:1',
 '__LMOD_REF_COUNT_LIBRARY_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/lib:1',
 '__LMOD_REF_COUNT_LOADEDMODULES': 'StdEnv:1;gcc/4.8.5:1;zsh/5.8:1;tmux/3.2a:1;proxy:1;nccl/2.11.4-1:1',
 '__LMOD_REF_COUNT_MANPATH': '/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf/share/man:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/share/man:1;/cluster/apps/sfos/share/man/man1:1;/cluster/apps/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/share/man:1;/cluster/apps/lsf/10.1/man:1',
 '__LMOD_REF_COUNT_PATH': '/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf/bin:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/bin:1;/cluster/apps/local:2;/cluster/apps/sfos/bin:1;/cluster/apps/slurm/bin:1;/usr/lib64/qt-3.3/bin:1;/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/bin:1;/usr/local/bin:1;/usr/bin:1;/usr/local/sbin:1;/usr/sbin:1;/cluster/home/user/.local/bin:1;/cluster/home/user/bin:1',
 '__LMOD_REF_COUNT_PKG_CONFIG_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib/pkgconfig:1',
 '__LMOD_REF_COUNT__LMFILES_': '/cluster/apps/lmodules/Core/StdEnv.lua:1;/cluster/apps/lmodules/Core/gcc/4.8.5.lua:1;/cluster/apps/lmodules/Compiler/gcc/4.8.5/zsh/5.8.lua:1;/cluster/apps/lmodules/Compiler/gcc/4.8.5/tmux/3.2a.lua:1;/cluster/apps/lmodules/Core/proxy.lua:1;/cluster/apps/lmodules/Compiler/gcc/4.8.5/nccl/2.11.4-1.lua:1',
 'ftp_proxy': 'http://blabla:3128',
 'http_proxy': 'http://blabla:3128',
 'https_proxy': 'http://blabla:3128',
 'no_proxy': 'api.wandb.ai,app.neptune.ai',
 'tmux_version': '3.2',
 'xml_catalog_files_libxslt': ''}

From a quick scan I see that SLURM_NTASKS is 2 (as expected) on the working server and 1 on the problematic server. I don't know yet why this is the case because I specify the ntasks only in the sbatch script to 2 and nowhere else. Just my first observation so far.

magehrig commented 1 year ago

I found a workaround. Strangely, when I additionally set ntasks=NUM_GPUS, DDP works as expected. In this case, on the problematic cluster I get SLURM_NTAKS=NUM_GPUS and then the script runs correctly. So the augmented sbatch script is:

#!/usr/bin/env bash
#SBATCH --parsable
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=16G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=3
#SBATCH --gpus=rtx_3090:2
#SBATCH --output=/some/path/%j.out

module load nccl
source /some/path/conda.sh
conda activate myenv
srun python myscript.py ...
conda deactivate

No idea why ntasks-per-node is not sufficient.

magehrig commented 1 year ago

Got a response from the cluster support. Apparently they still need to configure: #tasks = #node * #ntasks-per-node.

TLDR: it is a slurm config issue not PL related.

zhjohnchan commented 1 year ago

For SLURM users (using the interactive mode), this could be an issue.

Try to downgrade the pytorch-lightning: pip install pytorch_lightning==1.7.7.

felix-ky commented 1 year ago

got the same problem

try

unset KUBERNETES_PORT

it works for me... I spend one night and one morning on it...TT There is a same problem link: #5254

many thanks, it really works!!!

awaelchli commented 1 year ago

@superhero-7 Were you able to resolve the issue on your end? I couldn't figure out whether this is an issue with Lightning or not.

jasonkena commented 1 year ago

For SLURM users (using the interactive mode), this could be an issue.

Try to downgrade the pytorch-lightning: pip install pytorch_lightning==1.7.7.

I ran into the same issue. Seeing https://github.com/Lightning-AI/lightning/issues/5225#issuecomment-750032030 and the docs, I solved it by adding os.environ["SLURM_JOB_NAME"]="bash" to my script.

awaelchli commented 1 year ago

@jasonkena That'll work yes. Here is the proper docs link for this. The other users who commented here had an issue with the kubernetes environment variable and I fixed this in the linked PR: https://github.com/Lightning-AI/lightning/pull/18137

Master-cai commented 1 year ago

@awaelchli Thanks for your work! I'm using the kubernetes environment and unset KUBERNETES_PORT works for me when I only use one node. However, I need to use multi nodes so I can't do unset KUBERNETES_PORT. I would like to know which version includes this patch? And i'm using pytorch-lightning 1.9.0 , are there any quick solutions available without upgrading the pytorch-lightning?

Master-cai commented 1 year ago

reply to myself: following this PR https://github.com/Lightning-AI/lightning/pull/18137, I manually modified the two line in the source code, and this works for me.

phrasenmaeher commented 8 months ago

For those using SLURM, don't forget to use srun python ... instead of plain python ... to start your job (taking into account the previous settings, of course).