Closed superhero-7 closed 1 year ago
got the same problem
got the same problem
try
unset KUBERNETES_PORT
it works for me... I spend one night and one morning on it...TT There is a same problem link: https://github.com/Lightning-AI/lightning/issues/5254
unset KUBERNETES_PORT
Solved.. Thx
@superhero-7 Unfortunately I don't know how the KUBERNETES_PORT relates to this problem here, or even how it solved it. Does that mean this issue is closed, or are there still some open questions?
@superhero-7 Unfortunately I don't know how the KUBERNETES_PORT relates to this problem here, or even how it solved it. Does that mean this issue is closed, or are there still some open questions?
Our machines are managed by k8s, I suppose maybe there are some conflicts about the GLOBAL RANK enviroment between k8s setting and pytorch_lightning ddp setting?
I got the same issue but on a SLURM cluster. I have access to two SLURM clusters. Interestingly, on one cluster PL DDP works fine but on the second one, I experience this issue. Since I don't use K8s, unset KUBERNETES_PORT
does not solve the issue.
I guess it would be really hard to reproduce this. Any pointers to what I could try?
You could try printing the os.environ
at the beginning of the script and comparing it between the two nodes. See if any env variables are set that shouldn't or ones that are missing. You could also post the printout here if you like (but redact any sensitive information) so we can take a look.
Since you are using SLURM, make sure to follow exactly the instructions here.
@awaelchli Great idea! I think I should have correctly followed the instructions. Since I use two different (SLURM) clusters they have a slightly different sbatch script but the rest is the same.
For this test, I use two GPUs on a single node.
First sbatch script for the server on which there are no issues:
#!/usr/bin/env bash
#SBATCH --parsable
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=16G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:2
#SBATCH --output=/some/path/%j.out
module load nccl
source /some/path/conda.sh
conda activate myenv
srun python myscript.py ...
conda deactivate
Second sbatch script for the server where I observe the described issue:
#!/usr/bin/env bash
#SBATCH --parsable
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=16G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=3
#SBATCH --gpus=rtx_3090:2
#SBATCH --output=/some/path/%j.out
module load nccl
source /some/path/conda.sh
conda activate myenv
srun python myscript.py ...
conda deactivate
Now, the os.environ output on the server where I observe no issues:
{'ACLOCAL_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/share/aclocal',
'BASH_ENV': '/cluster/lmod-8.6.5/lmod/lmod/init/bash',
'BASH_FUNC_ml%%': '() { eval $($LMOD_DIR/ml_cmd "$@")\n}',
'BASH_FUNC_module%%': '() { eval $($LMOD_CMD bash "$@") && eval '
'$(${LMOD_SETTARG_CMD:-:} -s sh)\n'
'}',
'CMAKE_PREFIX_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy',
'CONDA_DEFAULT_ENV': 'rnn-st',
'CONDA_EXE': '/data/user/programs/mambaforge/bin/conda',
'CONDA_MKL_INTERFACE_LAYER_BACKUP': '',
'CONDA_PREFIX': '/data/user/programs/mambaforge/envs/rnn-st',
'CONDA_PROMPT_MODIFIER': '(rnn-st) ',
'CONDA_PYTHON_EXE': '/data/user/programs/mambaforge/bin/python',
'CONDA_SHLVL': '1',
'CPATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/include:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/include',
'CRC32C_SW_MODE': 'auto',
'CUDA_DEVICE_ORDER': 'PCI_BUS_ID',
'CUDA_HOME': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh',
'CUDA_VISIBLE_DEVICES': '0,1',
'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/891944109/bus',
'DISPLAY': 'u20-login-1:16.0',
'ENVIRONMENT': 'BATCH',
'GPU_DEVICE_ORDINAL': '0,1',
'HOME': '/home/user',
'HOSTNAME': 'u20-computeibmgpu-vesta7',
'LANG': 'C.UTF-8',
'LC_ADDRESS': 'de_CH.UTF-8',
'LC_IDENTIFICATION': 'de_CH.UTF-8',
'LC_MEASUREMENT': 'de_CH.UTF-8',
'LC_MONETARY': 'de_CH.UTF-8',
'LC_NAME': 'de_CH.UTF-8',
'LC_NUMERIC': 'de_CH.UTF-8',
'LC_PAPER': 'de_CH.UTF-8',
'LC_TELEPHONE': 'de_CH.UTF-8',
'LC_TIME': 'de_CH.UTF-8',
'LD_LIBRARY_PATH': '/data/user/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/../../lib64:/cluster/munge-0.5.14/lib:/cluster/slurm-20-11-8-1/lib:/cluster/pmix-4.1.2/lib:/cluster/libevent-2.1.12/lib',
'LESS': '-R',
'LESSCLOSE': '/usr/bin/lesspipe %s %s',
'LESSOPEN': '| /usr/bin/lesspipe %s',
'LIBRARY_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/lib64:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/lib:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/lib',
'LMOD_CMD': '/cluster/lmod-8.6.5/lmod/lmod/libexec/lmod',
'LMOD_COLORIZE': 'yes',
'LMOD_DIR': '/cluster/lmod-8.6.5/lmod/lmod/libexec',
'LMOD_FAMILY_GRES': 'v100',
'LMOD_FAMILY_GRES_VERSION': 'false',
'LMOD_FAMILY_RESOURCE': 'multigpu',
'LMOD_FAMILY_RESOURCE_VERSION': 'false',
'LMOD_FULL_SETTARG_SUPPORT': 'no',
'LMOD_MODULERCFILE': '/apps/etc/modules/.modulerc.lua',
'LMOD_PACKAGE_PATH': '/cluster/lmod-8.6.5',
'LMOD_PKG': '/cluster/lmod-8.6.5/lmod/lmod',
'LMOD_PREPEND_BLOCK': 'normal',
'LMOD_ROOT': '/cluster/lmod-8.6.5/lmod',
'LMOD_SETTARG_CMD': ':',
'LMOD_SETTARG_FULL_SUPPORT': 'no',
'LMOD_VERSION': '8.6.5',
'LMOD_arch': 'x86_64',
'LMOD_sys': 'Linux',
'LOADEDMODULES': 'v100:multigpu:libiconv/1.16-pdflaob:xz/5.2.5-mhrz5su:zlib/1.2.12-j4b6zeg:libxml2/2.9.12-koohqap:cuda/11.4.4-ldlywt5:libnl/3.3.0-qtnpjoa:rdma-core/41.0-hquyri7:nccl/2.11.4-1',
'LOGNAME': 'user',
'LSCOLORS': 'Gxfxcxdxbxegedabagacad',
'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:',
'MANPATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/share/man:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/share/man:/cluster/lmod-8.6.5/lmod/lmod/share/man::/var/cfengine/share/man',
'MKL_INTERFACE_LAYER': 'LP64,GNU',
'MKL_NUM_THREADS': '1',
'MODULEPATH': '/apps/etc/modules/multigpu:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core:/apps/etc/modules/system:/apps/etc/modules/containers:/apps/etc/modules/default:/apps/etc/modules/flavors',
'MODULEPATH_ROOT': '/apps/etc/modules',
'MODULESHOME': '/cluster/lmod-8.6.5/lmod/lmod',
'MOTD_SHOWN': 'pam',
'NUMEXPR_NUM_THREADS': '1',
'OLDPWD': '/home/user',
'OMP_NUM_THREADS': '1',
'OPENBLAS_NUM_THREADS': '1',
'OPENCV_OPENCL_RUNTIME': 'disabled',
'PAGER': 'less',
'PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/bin:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/bin:/data/user/programs/mambaforge/envs/rnn-st/bin:/data/user/programs/mambaforge/condabin:/cluster/slurm-20-11-8-1/bin:/cluster/slurm-20-11-8-1/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/var/cfengine/bin:/usr/local/go/bin',
'PKG_CONFIG_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/lib/pkgconfig:/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/lib/pkgconfig',
'PMI_FD': '13',
'PMI_JOBID': '72348.0',
'PMI_RANK': '1',
'PMI_SIZE': '2',
'PWD': '/data/user/code/rnn-st/scripts/slurm',
'PYTORCH_NVML_BASED_CUDA_CHECK': '1',
'QT_QPA_FONTDIR': '/data/user/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/qt/fonts',
'QT_QPA_PLATFORM_PLUGIN_PATH': '/data/user/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/qt/plugins',
'ROCR_VISIBLE_DEVICES': '0,1',
'SACCT_FORMAT': 'jobid%-6,jobname,maxrss,maxvmsize,alloccpus,elapsed%12,state,exitcode%6',
'SALLOC_CONSTRAINT': 'MULTIGPU',
'SBATCH_CONSTRAINT': 'MULTIGPU',
'SHELL': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zsh-5.9-cghjsxkx626zdkbcilxi3tk3nshivvo6/bin/zsh',
'SHLVL': '3',
'SLURMD_NODENAME': 'u20-computeibmgpu-vesta7',
'SLURM_CLUSTER_NAME': 'cluster',
'SLURM_CONF': '/cluster/slurm-20-11-8-1/etc/slurm.conf',
'SLURM_CONSTRAINT': 'MULTIGPU',
'SLURM_CPUS_ON_NODE': '4',
'SLURM_CPUS_PER_TASK': '2',
'SLURM_CPU_BIND': 'quiet,mask_cpu:0x020000020000,0x400000400000',
'SLURM_CPU_BIND_LIST': '0x020000020000,0x400000400000',
'SLURM_CPU_BIND_TYPE': 'mask_cpu:',
'SLURM_CPU_BIND_VERBOSE': 'quiet',
'SLURM_DISTRIBUTION': 'block',
'SLURM_GTIDS': '0,1',
'SLURM_JOBID': '72348',
'SLURM_JOB_ACCOUNT': 'something',
'SLURM_JOB_CPUS_PER_NODE': '4',
'SLURM_JOB_CPUS_PER_NODE_PACK_GROUP_0': '4',
'SLURM_JOB_GID': '891944109',
'SLURM_JOB_GPUS': '5,6',
'SLURM_JOB_ID': '72348',
'SLURM_JOB_NAME': 'train.job',
'SLURM_JOB_NODELIST': 'u20-computeibmgpu-vesta7',
'SLURM_JOB_NUM_NODES': '1',
'SLURM_JOB_PARTITION': 'standard',
'SLURM_JOB_QOS': 'normal',
'SLURM_JOB_UID': '891944109',
'SLURM_JOB_USER': 'user',
'SLURM_LAUNCH_NODE_IPADDR': '10.129.48.36',
'SLURM_LOCALID': '1',
'SLURM_MEM_PER_CPU': '16384',
'SLURM_MPI_TYPE': 'pmi2',
'SLURM_NNODES': '1',
'SLURM_NODEID': '0',
'SLURM_NODELIST': 'u20-computeibmgpu-vesta7',
'SLURM_NODE_ALIASES': '(null)',
'SLURM_NPROCS': '2',
'SLURM_NTASKS': '2',
'SLURM_NTASKS_PER_NODE': '2',
'SLURM_PRIO_PROCESS': '0',
'SLURM_PROCID': '1',
'SLURM_SRUN_COMM_HOST': '10.129.48.36',
'SLURM_SRUN_COMM_PORT': '39247',
'SLURM_STEPID': '0',
'SLURM_STEP_GPUS': '5,6',
'SLURM_STEP_ID': '0',
'SLURM_STEP_LAUNCHER_PORT': '39247',
'SLURM_STEP_NODELIST': 'u20-computeibmgpu-vesta7',
'SLURM_STEP_NUM_NODES': '1',
'SLURM_STEP_NUM_TASKS': '2',
'SLURM_STEP_RESV_PORTS': '12585-12587',
'SLURM_STEP_TASKS_PER_NODE': '2',
'SLURM_SUBMIT_DIR': '/data/user/code/rnn-st/scripts/slurm',
'SLURM_SUBMIT_HOST': 'u20-computeibmgpu-vesta7',
'SLURM_TASKS_PER_NODE': '2',
'SLURM_TASK_PID': '577439',
'SLURM_TOPOLOGY_ADDR': 'u20-computeibmgpu-vesta7',
'SLURM_TOPOLOGY_ADDR_PATTERN': 'node',
'SLURM_UMASK': '0002',
'SLURM_WORKING_CLUSTER': 'cluster:u20-controller.hydra:6817:9216:109',
'SPACK_ROOT': '/apps',
'SQUEUE_FORMAT': '%8i %7u %12T %.3C %.6m %.12M %20e %R',
'SRUN_DEBUG': '3',
'SSH_AGENT_PID': '20309',
'SSH_AUTH_SOCK': '/tmp/ssh-TLK7wup2nTiA/agent.20307',
'SSH_CLIENT': '195.176.113.242 35588 22',
'SSH_CONNECTION': '195.176.113.235 32866 172.16.0.75 22',
'SSH_TTY': '/dev/pts/0',
'TERM': 'tmux-256color',
'TMPDIR': '/data/user/tmp/72348',
'TMUX': '/tmp//tmux-891944109/default,132182,0',
'TMUX_PANE': '%41',
'TMUX_PLUGIN_MANAGER_PATH': '/home/user/.tmux/plugins/',
'USER': 'user',
'VECLIB_MAXIMUM_THREADS': '1',
'WANDB_REQUIRE_SERVICE': 'True',
'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop',
'XDG_RUNTIME_DIR': '/run/user/891944109',
'XDG_SESSION_CLASS': 'user',
'XDG_SESSION_ID': '511',
'XDG_SESSION_TYPE': 'tty',
'ZSH': '/home/user/.myconfig/zsh/oh-my-zsh',
'ZSH_TMUX_CONFIG': '/home/user/.tmux.conf',
'ZSH_TMUX_TERM': 'screen-256color',
'_': '/cluster/slurm-20-11-8-1/bin/srun',
'_CE_CONDA': '',
'_CE_M': '',
'_LMFILES_': '/apps/etc/modules/flavors/v100.lua:/apps/etc/modules/flavors/multigpu.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libiconv/1.16-pdflaob.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/xz/5.2.5-mhrz5su.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/zlib/1.2.12-j4b6zeg.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libxml2/2.9.12-koohqap.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/cuda/11.4.4-ldlywt5.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libnl/3.3.0-qtnpjoa.lua:/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/rdma-core/41.0-hquyri7.lua:/apps/etc/modules/multigpu/nccl/2.11.4-1.lua',
'_ModuleTable001_': 'X01vZHVsZVRhYmxlXyA9IHsKTVR2ZXJzaW9uID0gMywKY19yZWJ1aWxkVGltZSA9IDcyMDAuMCwKY19zaG9ydFRpbWUgPSAwLjM5MTMyNDk5Njk0ODI0LApkZXB0aFQgPSB7fSwKZmFtaWx5ID0gewpncmVzID0gInYxMDAiLApyZXNvdXJjZSA9ICJtdWx0aWdwdSIsCn0sCm1UID0gewpjdWRhID0gewpmbiA9ICIvYXBwcy9zaGFyZS9zcGFjay9sbW9kL2xpbnV4LXVidW50dTIwLjA0LXg4Nl82NC9Db3JlL2N1ZGEvMTEuNC40LWxkbHl3dDUubHVhIiwKZnVsbE5hbWUgPSAiY3VkYS8xMS40LjQtbGRseXd0NSIsCmxvYWRPcmRlciA9IDcsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAiY3VkYS8xMS40LjQtbGRseXd0NSIs',
'_ModuleTable002_': 'CndWID0gIjAwMDAwMDAxMS4wMDAwMDAwMDQuMDAwMDAwMDA0LipsZGx5d3QuMDAwMDAwMDA1Lip6ZmluYWwiLAp9LApsaWJpY29udiA9IHsKZm4gPSAiL2FwcHMvc2hhcmUvc3BhY2svbG1vZC9saW51eC11YnVudHUyMC4wNC14ODZfNjQvQ29yZS9saWJpY29udi8xLjE2LXBkZmxhb2IubHVhIiwKZnVsbE5hbWUgPSAibGliaWNvbnYvMS4xNi1wZGZsYW9iIiwKbG9hZE9yZGVyID0gMywKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJsaWJpY29udi8xLjE2LXBkZmxhb2IiLAp3ViA9ICIwMDAwMDAwMDEuMDAwMDAwMDE2LipkZmxhb2IuKnpmaW5hbCIsCn0sCmxpYm5sID0gewpmbiA9ICIvYXBwcy9zaGFyZS9zcGFjay9s',
'_ModuleTable003_': 'bW9kL2xpbnV4LXVidW50dTIwLjA0LXg4Nl82NC9Db3JlL2xpYm5sLzMuMy4wLXF0bnBqb2EubHVhIiwKZnVsbE5hbWUgPSAibGlibmwvMy4zLjAtcXRucGpvYSIsCmxvYWRPcmRlciA9IDgsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAibGlibmwvMy4zLjAtcXRucGpvYSIsCndWID0gIjAwMDAwMDAwMy4wMDAwMDAwMDMuKnF0bnBqb2EuKnpmaW5hbCIsCn0sCmxpYnhtbDIgPSB7CmZuID0gIi9hcHBzL3NoYXJlL3NwYWNrL2xtb2QvbGludXgtdWJ1bnR1MjAuMDQteDg2XzY0L0NvcmUvbGlieG1sMi8yLjkuMTIta29vaHFhcC5sdWEiLApmdWxsTmFtZSA9ICJsaWJ4bWwyLzIuOS4xMi1rb29ocWFwIiwKbG9hZE9yZGVy',
'_ModuleTable004_': 'ID0gNiwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJsaWJ4bWwyLzIuOS4xMi1rb29ocWFwIiwKd1YgPSAiMDAwMDAwMDAyLjAwMDAwMDAwOS4wMDAwMDAwMTIuKmtvb2hxYXAuKnpmaW5hbCIsCn0sCm11bHRpZ3B1ID0gewpmbiA9ICIvYXBwcy9ldGMvbW9kdWxlcy9mbGF2b3JzL211bHRpZ3B1Lmx1YSIsCmZ1bGxOYW1lID0gIm11bHRpZ3B1IiwKbG9hZE9yZGVyID0gMiwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJtdWx0aWdwdSIsCndWID0gIk0uKnpmaW5hbCIsCn0sCm5jY2wgPSB7CmZuID0gIi9hcHBzL2V0Yy9tb2R1bGVzL211bHRpZ3B1L25jY2wv',
'_ModuleTable005_': 'Mi4xMS40LTEubHVhIiwKZnVsbE5hbWUgPSAibmNjbC8yLjExLjQtMSIsCmxvYWRPcmRlciA9IDEwLApwcm9wVCA9IHt9LApzdGFja0RlcHRoID0gMCwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gIm5jY2wiLAp3ViA9ICIwMDAwMDAwMDIuMDAwMDAwMDExLjAwMDAwMDAwNC4qemZpbmFsLS4wMDAwMDAwMDEuKnpmaW5hbCIsCn0sClsicmRtYS1jb3JlIl0gPSB7CmZuID0gIi9hcHBzL3NoYXJlL3NwYWNrL2xtb2QvbGludXgtdWJ1bnR1MjAuMDQteDg2XzY0L0NvcmUvcmRtYS1jb3JlLzQxLjAtaHF1eXJpNy5sdWEiLApmdWxsTmFtZSA9ICJyZG1hLWNvcmUvNDEuMC1ocXV5cmk3IiwKbG9hZE9yZGVyID0gOSwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1',
'_ModuleTable006_': 'cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJyZG1hLWNvcmUvNDEuMC1ocXV5cmk3IiwKd1YgPSAiMDAwMDAwMDQxLipocXV5cmkuMDAwMDAwMDA3Lip6ZmluYWwiLAp9LAp2MTAwID0gewpmbiA9ICIvYXBwcy9ldGMvbW9kdWxlcy9mbGF2b3JzL3YxMDAubHVhIiwKZnVsbE5hbWUgPSAidjEwMCIsCmxvYWRPcmRlciA9IDEsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAidjEwMCIsCndWID0gIk0uKnpmaW5hbCIsCn0sCnh6ID0gewpmbiA9ICIvYXBwcy9zaGFyZS9zcGFjay9sbW9kL2xpbnV4LXVidW50dTIwLjA0LXg4Nl82NC9Db3JlL3h6LzUuMi41LW1ocno1c3UubHVhIiwKZnVsbE5hbWUgPSAieHovNS4yLjUtbWhy',
'_ModuleTable007_': 'ejVzdSIsCmxvYWRPcmRlciA9IDQsCnByb3BUID0ge30sCnN0YWNrRGVwdGggPSAwLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAieHovNS4yLjUtbWhyejVzdSIsCndWID0gIjAwMDAwMDAwNS4wMDAwMDAwMDIuMDAwMDAwMDA1LiptaHJ6LjAwMDAwMDAwNS4qc3UuKnpmaW5hbCIsCn0sCnpsaWIgPSB7CmZuID0gIi9hcHBzL3NoYXJlL3NwYWNrL2xtb2QvbGludXgtdWJ1bnR1MjAuMDQteDg2XzY0L0NvcmUvemxpYi8xLjIuMTItajRiNnplZy5sdWEiLApmdWxsTmFtZSA9ICJ6bGliLzEuMi4xMi1qNGI2emVnIiwKbG9hZE9yZGVyID0gNSwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJ6bGliLzEuMi4xMi1q',
'_ModuleTable008_': 'NGI2emVnIiwKd1YgPSAiMDAwMDAwMDAxLjAwMDAwMDAwMi4wMDAwMDAwMTIuKmouMDAwMDAwMDA0LipiLjAwMDAwMDAwNi4qemVnLip6ZmluYWwiLAp9LAp9LAptcGF0aEEgPSB7CiIvYXBwcy9ldGMvbW9kdWxlcy9tdWx0aWdwdSIKLCAiL2FwcHMvc2hhcmUvc3BhY2svbG1vZC9saW51eC11YnVudHUyMC4wNC14ODZfNjQvQ29yZSIKLCAiL2FwcHMvZXRjL21vZHVsZXMvc3lzdGVtIiwgIi9hcHBzL2V0Yy9tb2R1bGVzL2NvbnRhaW5lcnMiCiwgIi9hcHBzL2V0Yy9tb2R1bGVzL2RlZmF1bHQiLCAiL2FwcHMvZXRjL21vZHVsZXMvZmxhdm9ycyIsCn0sCnN5c3RlbUJhc2VNUEFUSCA9ICIvYXBwcy9zaGFyZS9zcGFjay9sbW9kL2xpbnV4LXVidW50dTIwLjA0LXg4Nl82NC9Db3Jl',
'_ModuleTable009_': 'Oi9hcHBzL2V0Yy9tb2R1bGVzL3N5c3RlbTovYXBwcy9ldGMvbW9kdWxlcy9jb250YWluZXJzOi9hcHBzL2V0Yy9tb2R1bGVzL2RlZmF1bHQ6L2FwcHMvZXRjL21vZHVsZXMvZmxhdm9ycyIsCn0K',
'_ModuleTable_Sz_': '9',
'_ZSH_TMUX_FIXED_CONFIG': '/home/user/.myconfig/zsh/oh-my-zsh/plugins/tmux/tmux.extra.conf',
'__LMOD_REF_COUNT_ACLOCAL_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/share/aclocal:2',
'__LMOD_REF_COUNT_CMAKE_PREFIX_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy:2',
'__LMOD_REF_COUNT_CPATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/include:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/include:1',
'__LMOD_REF_COUNT_LIBRARY_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/lib64:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/lib:1;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/lib:1',
'__LMOD_REF_COUNT_LOADEDMODULES': 'v100:1;multigpu:1;libiconv/1.16-pdflaob:1;xz/5.2.5-mhrz5su:1;zlib/1.2.12-j4b6zeg:1;libxml2/2.9.12-koohqap:1;cuda/11.4.4-ldlywt5:1;libnl/3.3.0-qtnpjoa:1;rdma-core/41.0-hquyri7:1;nccl/2.11.4-1:1',
'__LMOD_REF_COUNT_MANPATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/share/man:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/share/man:2;/cluster/lmod-8.6.5/lmod/lmod/share/man:1;/var/cfengine/share/man:1',
'__LMOD_REF_COUNT_MODULEPATH': '/apps/etc/modules/multigpu:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core:1;/apps/etc/modules/system:1;/apps/etc/modules/containers:1;/apps/etc/modules/default:1;/apps/etc/modules/flavors:1',
'__LMOD_REF_COUNT_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/cuda-11.4.4-ldlywt52dplvbjmb6juifqzm6sofqmbh/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/bin:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libiconv-1.16-pdflaobqhkm2yizzmiscm3g2hpqnlowy/bin:2;/data/user/programs/mambaforge/envs/rnn-st/bin:1;/data/user/programs/mambaforge/condabin:1;/cluster/slurm-20-11-8-1/bin:1;/cluster/slurm-20-11-8-1/sbin:1;/usr/local/sbin:1;/usr/local/bin:1;/usr/sbin:1;/usr/bin:1;/sbin:1;/bin:1;/usr/games:1;/usr/local/games:1;/snap/bin:1;/var/cfengine/bin:1;/usr/local/go/bin:3',
'__LMOD_REF_COUNT_PKG_CONFIG_PATH': '/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/nccl-2.11.4-1-54q2cryxtsonmwydk55tehpxqyhcbd5s/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/rdma-core-41.0-hquyri7sga3ruuajade2uxfyzsh37xpa/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libnl-3.3.0-qtnpjoa2i5wpvcgaix4v7xwbq2hovqlb/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/libxml2-2.9.12-koohqapjdhsq3lk6aarjnkpp5m5va2nk/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/zlib-1.2.12-j4b6zegwseutr44qyr66ym767esbxvjm/lib/pkgconfig:2;/apps/opt/spack/linux-ubuntu20.04-x86_64/gcc-9.3.0/xz-5.2.5-mhrz5subhwqlf35mzkrrigebxkkb7bje/lib/pkgconfig:2',
'__LMOD_REF_COUNT__LMFILES_': '/apps/etc/modules/flavors/v100.lua:1;/apps/etc/modules/flavors/multigpu.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libiconv/1.16-pdflaob.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/xz/5.2.5-mhrz5su.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/zlib/1.2.12-j4b6zeg.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libxml2/2.9.12-koohqap.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/cuda/11.4.4-ldlywt5.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/libnl/3.3.0-qtnpjoa.lua:1;/apps/share/spack/lmod/linux-ubuntu20.04-x86_64/Core/rdma-core/41.0-hquyri7.lua:1;/apps/etc/modules/multigpu/nccl/2.11.4-1.lua:1',
'__LMOD_SET_FPATH': '1',
'ftp_proxy': 'http://wtp.hydra:8080',
'http_proxy': 'http://wtp.hydra:8080',
'https_proxy': 'http://wtp.hydra:8080',
'no_proxy': 'localhost,127.0.0.1,10.129.60.84,.hydra,.int,',
'tmux_version': '3.0'}
The os.environ output on the server where I observe the described issue:
{'BASH_ENV': '/cluster/apps/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/init/bash',
'BASH_FUNC_ml%%': '() { eval $($LMOD_DIR/ml_cmd "$@")\n}',
'BASH_FUNC_ml()': '() { eval $($LMOD_DIR/ml_cmd "$@")\n}',
'BASH_FUNC_module%%': '() { eval $($LMOD_CMD bash "$@") && eval '
'$(${LMOD_SETTARG_CMD:-:} -s sh)\n'
'}',
'BASH_FUNC_module()': '() { eval $($LMOD_CMD bash "$@") && eval '
'$(${LMOD_SETTARG_CMD:-:} -s sh)\n'
'}',
'CC': '/usr/bin/gcc',
'CMAKE_PREFIX_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3:/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa',
'CONDA_DEFAULT_ENV': 'rnn-st',
'CONDA_EXE': '/cluster/project/lab/me/programs/mambaforge/bin/conda',
'CONDA_MKL_INTERFACE_LAYER_BACKUP': '',
'CONDA_PREFIX': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st',
'CONDA_PROMPT_MODIFIER': '(rnn-st) ',
'CONDA_PYTHON_EXE': '/cluster/project/lab/me/programs/mambaforge/bin/python',
'CONDA_SHLVL': '1',
'CONSUL_HTTP_ADDR': 'unix:///var/run/consul/http.sock',
'CPATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/include',
'CPP': '/usr/bin/cpp',
'CRC32C_SW_MODE': 'auto',
'CUDA_DEVICE_ORDER': 'PCI_BUS_ID',
'CUDA_VISIBLE_DEVICES': '0,1',
'CXX': '/usr/bin/g++',
'DISPLAY': 'localhost:11.0',
'ENVIRONMENT': 'BATCH',
'F77': '/usr/bin/gfortran',
'F90': '/usr/bin/gfortran',
'FC': '/usr/bin/gfortran',
'HISTCONTROL': 'ignoredups',
'HISTSIZE': '50000',
'HOME': '/cluster/home/user',
'HOSTNAME': 'eu-g4-015',
'I_MPI_PMI_LIBRARY': '/cluster/apps/slurm/lib/libpmi2.so',
'LANG': 'en_US.UTF-8',
'LD_LIBRARY_PATH': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/../../lib64:/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/lib:/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/lib::',
'LESS': '-R',
'LESSOPEN': '||/usr/bin/lesspipe.sh %s',
'LIBGL_ALWAYS_INDIRECT': '1',
'LIBRARY_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/lib',
'LMOD_CMD': '/cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/libexec/lmod',
'LMOD_DIR': '/cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/libexec',
'LMOD_FAMILY_COMPILER': 'gcc',
'LMOD_FAMILY_COMPILER_VERSION': '4.8.5',
'LMOD_PKG': '/cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod',
'LMOD_SETTARG_FULL_SUPPORT': 'no',
'LMOD_SYSTEM_DEFAULT_MODULES': 'StdEnv:gcc/4.8.5',
'LMOD_VERSION': '7.7.13',
'LMOD_sys': 'Linux',
'LOADEDMODULES': 'StdEnv:gcc/4.8.5:zsh/5.8:tmux/3.2a:proxy:nccl/2.11.4-1',
'LOGNAME': 'user',
'LSCOLORS': 'Gxfxcxdxbxegedabagacad',
'LSF_BINDIR': '/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/bin',
'LSF_ENVDIR': '/cluster/apps/lsf/conf',
'LSF_LIBDIR': '/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/lib',
'LSF_SERVERDIR': '/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/etc',
'LS_COLORS': 'rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:*.tar=38;5;9:*.tgz=38;5;9:*.arc=38;5;9:*.arj=38;5;9:*.taz=38;5;9:*.lha=38;5;9:*.lz4=38;5;9:*.lzh=38;5;9:*.lzma=38;5;9:*.tlz=38;5;9:*.txz=38;5;9:*.tzo=38;5;9:*.t7z=38;5;9:*.zip=38;5;9:*.z=38;5;9:*.Z=38;5;9:*.dz=38;5;9:*.gz=38;5;9:*.lrz=38;5;9:*.lz=38;5;9:*.lzo=38;5;9:*.xz=38;5;9:*.bz2=38;5;9:*.bz=38;5;9:*.tbz=38;5;9:*.tbz2=38;5;9:*.tz=38;5;9:*.deb=38;5;9:*.rpm=38;5;9:*.jar=38;5;9:*.war=38;5;9:*.ear=38;5;9:*.sar=38;5;9:*.rar=38;5;9:*.alz=38;5;9:*.ace=38;5;9:*.zoo=38;5;9:*.cpio=38;5;9:*.7z=38;5;9:*.rz=38;5;9:*.cab=38;5;9:*.jpg=38;5;13:*.jpeg=38;5;13:*.gif=38;5;13:*.bmp=38;5;13:*.pbm=38;5;13:*.pgm=38;5;13:*.ppm=38;5;13:*.tga=38;5;13:*.xbm=38;5;13:*.xpm=38;5;13:*.tif=38;5;13:*.tiff=38;5;13:*.png=38;5;13:*.svg=38;5;13:*.svgz=38;5;13:*.mng=38;5;13:*.pcx=38;5;13:*.mov=38;5;13:*.mpg=38;5;13:*.mpeg=38;5;13:*.m2v=38;5;13:*.mkv=38;5;13:*.webm=38;5;13:*.ogm=38;5;13:*.mp4=38;5;13:*.m4v=38;5;13:*.mp4v=38;5;13:*.vob=38;5;13:*.qt=38;5;13:*.nuv=38;5;13:*.wmv=38;5;13:*.asf=38;5;13:*.rm=38;5;13:*.rmvb=38;5;13:*.flc=38;5;13:*.avi=38;5;13:*.fli=38;5;13:*.flv=38;5;13:*.gl=38;5;13:*.dl=38;5;13:*.xcf=38;5;13:*.xwd=38;5;13:*.yuv=38;5;13:*.cgm=38;5;13:*.emf=38;5;13:*.axv=38;5;13:*.anx=38;5;13:*.ogv=38;5;13:*.ogx=38;5;13:*.aac=38;5;45:*.au=38;5;45:*.flac=38;5;45:*.mid=38;5;45:*.midi=38;5;45:*.mka=38;5;45:*.mp3=38;5;45:*.mpc=38;5;45:*.ogg=38;5;45:*.ra=38;5;45:*.wav=38;5;45:*.axa=38;5;45:*.oga=38;5;45:*.spx=38;5;45:*.xspf=38;5;45:',
'MAIL': '/var/spool/mail/user',
'MANPATH': '/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf/share/man:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/share/man:/cluster/apps/sfos/share/man/man1:/cluster/apps/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/share/man:/cluster/apps/lsf/10.1/man::',
'MKL_INTERFACE_LAYER': 'LP64,GNU',
'MKL_NUM_THREADS': '1',
'MODULEPATH': '/cluster/apps/lmodules/Compiler/gcc/4.8.5:/cluster/apps/lmodules/Linux:/cluster/apps/lmodules/Core',
'MODULEPATH_ROOT': '/cluster/apps/lmodules',
'MODULESHOME': '/cluster/spack/apps/linux-centos7-x86_64/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod',
'NCCL_ROOT': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3',
'NUMEXPR_NUM_THREADS': '1',
'OLDPWD': '/cluster/project/lab/me/code/rnn-st',
'OMP_NUM_THREADS': '1',
'OPENBLAS_NUM_THREADS': '1',
'OPENCV_OPENCL_RUNTIME': 'disabled',
'PAGER': 'less',
'PATH': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/bin:/cluster/project/lab/me/programs/mambaforge/condabin:/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf/bin:/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/bin:/cluster/apps/local:/cluster/apps/sfos/bin:/cluster/apps/slurm/bin:/usr/lib64/qt-3.3/bin:/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/cluster/home/user/.local/bin:/cluster/home/user/bin:/usr/local/go/bin:/usr/local/go/bin',
'PKG_CONFIG_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib/pkgconfig',
'PMI_FD': '10',
'PMI_JOBID': '7099018.0',
'PMI_RANK': '0',
'PMI_SIZE': '1',
'PWD': '/cluster/project/lab/me/code/rnn-st/scripts/slurm',
'PYTORCH_NVML_BASED_CUDA_CHECK': '1',
'QTDIR': '/usr/lib64/qt-3.3',
'QTINC': '/usr/lib64/qt-3.3/include',
'QTLIB': '/usr/lib64/qt-3.3/lib',
'QT_GRAPHICSSYSTEM_CHECKED': '1',
'QT_QPA_FONTDIR': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/qt/fonts',
'QT_QPA_PLATFORM_PLUGIN_PATH': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/lib/python3.9/site-packages/cv2/qt/plugins',
'SCRATCH': '/cluster/scratch/user',
'SHELL': '/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/bin/zsh',
'SHLVL': '3',
'SHOST': 'eu-login-41',
'SLURMD_NODENAME': 'eu-g4-015',
'SLURM_CLUSTER_NAME': 'cluster',
'SLURM_CONF': '/cluster/slurm/adm/etc/slurm.conf',
'SLURM_CPUS_ON_NODE': '3',
'SLURM_CPUS_PER_TASK': '3',
'SLURM_CPU_BIND_LIST': '0x0000000000000000000000000000001C',
'SLURM_CPU_BIND_TYPE': 'mask_cpu:',
'SLURM_CPU_BIND_VERBOSE': 'quiet',
'SLURM_CPU_Bwandb: IND': 'quiet,mask_cpu:0x0000000000000000000000000000001C',
'SLURM_DISTRIBUTION': 'cyclic',
'SLURM_GPUS': 'nvidia_geforce_rtx_3090:2',
'SLURM_GPUS_ON_NODE': '2',
'SLURM_GTIDS': '0',
'SLURM_JOBID': '7099018',
'SLURM_JOB_ACCOUNT': 'gpuhe/es_scara',
'SLURM_JOB_CPUS_PER_NODE': '3',
'SLURM_JOB_CPUS_PER_NODE_PACK_GROUP_0': '3',
'SLURM_JOB_GID': '476131',
'SLURM_JOB_GPUS': '2,3',
'SLURM_JOB_ID': '7099018',
'SLURM_JOB_NAME': 'train.job',
'SLURM_JOB_NODELIST': 'eu-g4-015',
'SLURM_JOB_NUM_NODES': '1',
'SLURM_JOB_PARTITION': 'gpuhe.120h',
'SLURM_JOB_QOS': 'es_scara/gpuhe',
'SLURM_JOB_UID': '575154',
'SLURM_JOB_USER': 'user',
'SLURM_LAUNCH_NODE_IPADDR': '10.205.100.15',
'SLURM_LOCALID': '0',
'SLURM_MEM_PER_CPU': '32768',
'SLURM_MPI_TYPE': 'pmi2',
'SLURM_NNODES': '1',
'SLURM_NODEID': '0',
'SLURM_NODELIST': 'eu-g4-015',
'SLURM_NODE_ALIASES': '(null)',
'SLURM_NPROCS': '1',
'SLURM_NTASKS': '1',
'SLURM_NTASKS_PER_NODE': '2',
'SLURM_PRIO_PROCESS': '0',
'SLURM_PROCID': '0',
'SLURM_SCRIPT_CONTEXT': 'prolog_task',
'SLURM_SRUN_COMM_HOST': '10.205.100.15',
'SLURM_SRUN_COMM_PORT': '40015',
'SLURM_STEPID': '0',
'SLURM_STEP_GPUS': '2,3',
'SLURM_STEP_ID': '0',
'SLURM_STEP_LAUNCHER_PORT': '40015',
'SLURM_STEP_NODELIST': 'eu-g4-015',
'SLURM_STEP_NUM_NODES': '1',
'SLURM_STEP_NUM_TASKS': '1',
'SLURM_STEP_TASKS_PER_NODE': '1',
'SLURM_SUBMIT_DIR': '/cluster/project/lab/me/code/rnn-st/scripts/slurm',
'SLURM_SUBMIT_HOST': 'eu-login-41',
'SLURM_TASKS_PER_NODE': '1',
'SLURM_TASK_PID': '122949',
'SLURM_TOPOLOGY_ADDR': '.cluster_gpuhe.eu-g4-015',
'SLURM_TOPOLOGY_ADDR_PATTERN': 'switch.switch.node',
'SLURM_UMASK': '0027',
'SLURM_WORKING_CLUSTER': 'cluster:10.205.212.30:6817:9728:109',
'SRUN_DEBUG': '3',
'SSH_AGENT_PID': '12110',
'SSH_AUTH_SOCK': '/tmp/ssh-kaM5RUgK0O4U/agent.12108',
'SSH_CLIENT': '10.6.209.217 33612 22',
'SSH_CONNECTION': '10.6.208.201 54404 129.132.93.116 22',
'SSH_TTY': '/dev/pts/8',
'TERM': 'tmux-256color',
'TERM_PROGRAM': 'tmux',
'TERM_PROGRAM_VERSION': '3.2a',
'TMOUT': '86400',
'TMPDIR': '/scratch/tmp.7099018.user',
'TMUX': '/tmp/tmux-575154/default,12176,0',
'TMUX_PANE': '%15',
'TMUX_PLUGIN_MANAGER_PATH': '/cluster/home/user/.tmux/plugins/',
'TMUX_ROOT': '/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf',
'USER': 'user',
'VECLIB_MAXIMUM_THREADS': '1',
'WANDB_REQUIRE_SERVICE': 'True',
'XDG_RUNTIME_DIR': '/run/user/575154',
'XDG_SESSION_ID': '4695',
'XML_CATALOG_FILES': '/cluster/project/lab/me/programs/mambaforge/envs/rnn-st/etc/xml/catalog',
'ZSH': '/cluster/home/user/.myconfig/zsh/oh-my-zsh',
'ZSH_ROOT': '/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa',
'ZSH_TMUX_CONFIG': '/cluster/home/user/.tmux.conf',
'ZSH_TMUX_TERM': 'screen-256color',
'_': '/cluster/apps/slurm/bin/srun',
'_CE_CONDA': '',
'_CE_M': '',
'_LMFILES_': '/cluster/apps/lmodules/Core/StdEnv.lua:/cluster/apps/lmodules/Core/gcc/4.8.5.lua:/cluster/apps/lmodules/Compiler/gcc/4.8.5/zsh/5.8.lua:/cluster/apps/lmodules/Compiler/gcc/4.8.5/tmux/3.2a.lua:/cluster/apps/lmodules/Core/proxy.lua:/cluster/apps/lmodules/Compiler/gcc/4.8.5/nccl/2.11.4-1.lua',
'_ModuleTable001_': 'X01vZHVsZVRhYmxlXz17WyJNVHZlcnNpb24iXT0zLFsiY19yZWJ1aWxkVGltZSJdPWZhbHNlLFsiY19zaG9ydFRpbWUiXT1mYWxzZSxkZXB0aFQ9e30sZmFtaWx5PXtbImNvbXBpbGVyIl09ImdjYyIsfSxtVD17U3RkRW52PXtbImZuIl09Ii9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29yZS9TdGRFbnYubHVhIixbImZ1bGxOYW1lIl09IlN0ZEVudiIsWyJsb2FkT3JkZXIiXT0xLHByb3BUPXt9LFsic3RhY2tEZXB0aCJdPTAsWyJzdGF0dXMiXT0iYWN0aXZlIixbInVzZXJOYW1lIl09IlN0ZEVudiIsfSxldGhfcHJveHk9e1siZm4iXT0iL2NsdXN0ZXIvYXBwcy9sbW9kdWxlcy9Db3JlL2V0aF9wcm94eS5sdWEiLFsiZnVsbE5hbWUiXT0iZXRoX3Byb3h5IixbImxvYWRPcmRlciJd',
'_ModuleTable002_': 'PTUscHJvcFQ9e30sWyJzdGFja0RlcHRoIl09MCxbInN0YXR1cyJdPSJhY3RpdmUiLFsidXNlck5hbWUiXT0iZXRoX3Byb3h5Iix9LGdjYz17WyJmbiJdPSIvY2x1c3Rlci9hcHBzL2xtb2R1bGVzL0NvcmUvZ2NjLzQuOC41Lmx1YSIsWyJmdWxsTmFtZSJdPSJnY2MvNC44LjUiLFsibG9hZE9yZGVyIl09Mixwcm9wVD17fSxbInN0YWNrRGVwdGgiXT0wLFsic3RhdHVzIl09ImFjdGl2ZSIsWyJ1c2VyTmFtZSJdPSJnY2MvNC44LjUiLH0sbmNjbD17WyJmbiJdPSIvY2x1c3Rlci9hcHBzL2xtb2R1bGVzL0NvbXBpbGVyL2djYy80LjguNS9uY2NsLzIuMTEuNC0xLmx1YSIsWyJmdWxsTmFtZSJdPSJuY2NsLzIuMTEuNC0xIixbImxvYWRPcmRlciJdPTYscHJvcFQ9e30sWyJzdGFja0Rl',
'_ModuleTable003_': 'cHRoIl09MCxbInN0YXR1cyJdPSJhY3RpdmUiLFsidXNlck5hbWUiXT0ibmNjbCIsfSx0bXV4PXtbImZuIl09Ii9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29tcGlsZXIvZ2NjLzQuOC41L3RtdXgvMy4yYS5sdWEiLFsiZnVsbE5hbWUiXT0idG11eC8zLjJhIixbImxvYWRPcmRlciJdPTQscHJvcFQ9e30sWyJzdGFja0RlcHRoIl09MCxbInN0YXR1cyJdPSJhY3RpdmUiLFsidXNlck5hbWUiXT0idG11eCIsfSx6c2g9e1siZm4iXT0iL2NsdXN0ZXIvYXBwcy9sbW9kdWxlcy9Db21waWxlci9nY2MvNC44LjUvenNoLzUuOC5sdWEiLFsiZnVsbE5hbWUiXT0ienNoLzUuOCIsWyJsb2FkT3JkZXIiXT0zLHByb3BUPXt9LFsic3RhY2tEZXB0aCJdPTAsWyJzdGF0dXMiXT0iYWN0aXZlIixb',
'_ModuleTable004_': 'InVzZXJOYW1lIl09InpzaCIsfSx9LG1wYXRoQT17Ii9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29tcGlsZXIvZ2NjLzQuOC41IiwiL2NsdXN0ZXIvYXBwcy9sbW9kdWxlcy9MaW51eCIsIi9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29yZSIsfSxbInN5c3RlbUJhc2VNUEFUSCJdPSIvY2x1c3Rlci9hcHBzL2xtb2R1bGVzL0xpbnV4Oi9jbHVzdGVyL2FwcHMvbG1vZHVsZXMvQ29yZSIsfQ==',
'_ModuleTable_Sz_': '4',
'_ZSH_TMUX_FIXED_CONFIG': '/cluster/home/user/.myconfig/zsh/oh-my-zsh/plugins/tmux/tmux.extra.conf',
'__Init_Default_Modules': '1',
'__LMOD_REF_COUNT_CMAKE_PREFIX_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3:1;/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa:1',
'__LMOD_REF_COUNT_CPATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/include:1',
'__LMOD_REF_COUNT_LD_LIBRARY_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/lib:1;/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/lib:1',
'__LMOD_REF_COUNT_LIBRARY_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/lib:1',
'__LMOD_REF_COUNT_LOADEDMODULES': 'StdEnv:1;gcc/4.8.5:1;zsh/5.8:1;tmux/3.2a:1;proxy:1;nccl/2.11.4-1:1',
'__LMOD_REF_COUNT_MANPATH': '/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf/share/man:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/share/man:1;/cluster/apps/sfos/share/man/man1:1;/cluster/apps/gcc-4.8.5/lmod-7.7.13-epk3osxslctnrx6gabjmwtudqm2vfbxf/lmod/lmod/share/man:1;/cluster/apps/lsf/10.1/man:1',
'__LMOD_REF_COUNT_PATH': '/cluster/apps/gcc-4.8.5/tmux-3.2a-z4vqjspgq6xq6k52gn4iuqzcyg6xtmvf/bin:1;/cluster/apps/gcc-4.8.5/zsh-5.8-tugg7jmlx5arwoxf65qprdyky2m6hcpa/bin:1;/cluster/apps/local:2;/cluster/apps/sfos/bin:1;/cluster/apps/slurm/bin:1;/usr/lib64/qt-3.3/bin:1;/cluster/apps/lsf/10.1/linux2.6-glibc2.3-x86_64/bin:1;/usr/local/bin:1;/usr/bin:1;/usr/local/sbin:1;/usr/sbin:1;/cluster/home/user/.local/bin:1;/cluster/home/user/bin:1',
'__LMOD_REF_COUNT_PKG_CONFIG_PATH': '/cluster/apps/gcc-4.8.5/nccl-2.11.4-1-35j5ir7eacqglua4gg7cxyzzlbi4n2w3/lib/pkgconfig:1',
'__LMOD_REF_COUNT__LMFILES_': '/cluster/apps/lmodules/Core/StdEnv.lua:1;/cluster/apps/lmodules/Core/gcc/4.8.5.lua:1;/cluster/apps/lmodules/Compiler/gcc/4.8.5/zsh/5.8.lua:1;/cluster/apps/lmodules/Compiler/gcc/4.8.5/tmux/3.2a.lua:1;/cluster/apps/lmodules/Core/proxy.lua:1;/cluster/apps/lmodules/Compiler/gcc/4.8.5/nccl/2.11.4-1.lua:1',
'ftp_proxy': 'http://blabla:3128',
'http_proxy': 'http://blabla:3128',
'https_proxy': 'http://blabla:3128',
'no_proxy': 'api.wandb.ai,app.neptune.ai',
'tmux_version': '3.2',
'xml_catalog_files_libxslt': ''}
From a quick scan I see that SLURM_NTASKS
is 2 (as expected) on the working server and 1 on the problematic server. I don't know yet why this is the case because I specify the ntasks only in the sbatch script to 2 and nowhere else. Just my first observation so far.
I found a workaround. Strangely, when I additionally set ntasks=NUM_GPUS, DDP works as expected. In this case, on the problematic cluster I get SLURM_NTAKS=NUM_GPUS and then the script runs correctly. So the augmented sbatch script is:
#!/usr/bin/env bash
#SBATCH --parsable
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=16G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=3
#SBATCH --gpus=rtx_3090:2
#SBATCH --output=/some/path/%j.out
module load nccl
source /some/path/conda.sh
conda activate myenv
srun python myscript.py ...
conda deactivate
No idea why ntasks-per-node is not sufficient.
Got a response from the cluster support. Apparently they still need to configure:
#tasks = #node * #ntasks-per-node
.
TLDR: it is a slurm config issue not PL related.
For SLURM users (using the interactive
mode), this could be an issue.
Try to downgrade the pytorch-lightning: pip install pytorch_lightning==1.7.7
.
got the same problem
try
unset KUBERNETES_PORT
it works for me... I spend one night and one morning on it...TT There is a same problem link: #5254
many thanks, it really works!!!
@superhero-7 Were you able to resolve the issue on your end? I couldn't figure out whether this is an issue with Lightning or not.
For SLURM users (using the
interactive
mode), this could be an issue.Try to downgrade the pytorch-lightning:
pip install pytorch_lightning==1.7.7
.
I ran into the same issue. Seeing https://github.com/Lightning-AI/lightning/issues/5225#issuecomment-750032030 and the docs, I solved it by adding os.environ["SLURM_JOB_NAME"]="bash"
to my script.
@jasonkena That'll work yes. Here is the proper docs link for this. The other users who commented here had an issue with the kubernetes environment variable and I fixed this in the linked PR: https://github.com/Lightning-AI/lightning/pull/18137
@awaelchli Thanks for your work! I'm using the kubernetes environment and unset KUBERNETES_PORT
works for me when I only use one node. However, I need to use multi nodes so I can't do unset KUBERNETES_PORT
.
I would like to know which version includes this patch? And i'm using pytorch-lightning 1.9.0
, are there any quick solutions available without upgrading the pytorch-lightning?
reply to myself: following this PR https://github.com/Lightning-AI/lightning/pull/18137, I manually modified the two line in the source code, and this works for me.
For those using SLURM, don't forget to use srun python ...
instead of plain python ...
to start your job (taking into account the previous settings, of course).
Bug description
I train the model like this,there are my code bellow:
And it works fine, and didn't drow any error.But it didn't runing on 8 gpus,instead, it only runing on the first gpu. And only initializing one MEMBER like this:
I am so confuse,beacause the progress bar is totally right.The length of my dataset is 1198099,and in the progress bar, it shows 37457 steps one epoch, I set batch size to 4, so there is 4837457 almost equal to 11198099.
But the question is, nvidia-smi only see the first gpu is runing,like bellow:
I don't understand why this happend?I hope someone can help me,thanks a lot!!!!!
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): I try the lastest and 1.7.3, get the same question #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 1.10): 1.12.1 cuda 11.3 #- Python version (e.g., 3.9): 3.8.5 #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: A100*8 #- How you installed Lightning(`conda`, `pip`, source): pip install pytorch_lightning==1.7.3 #- Running environment of LightningApp (e.g. local, cloud): ```More info
No response
cc @justusschock @awaelchli