training bash script bug for 128 x 128

I'm using the following slurm file to run the bash script for training run_128_B.sh on a remote gpu cluster. This also applies for the 256 x 256 as well.

#!/bin/bash
#SBATCH -J 128_magvit_train_0                         # Job name
#SBATCH -o /128_tokenizer/logs/128_magvit_train_0%j.out                  # output file (%j expands to jobID)
#SBATCH -e /128_tokenizer/logs/128_magvit_train_0%j.err                  # error log file (%j expands to jobID)
#SBATCH -N 1                                         # Total number of nodes requested
#SBATCH --ntasks-per-node=2                                        # Total number of cores requested
#SBATCH --get-user-env                               # retrieve the users login environment
#SBATCH --mem=200G                                 # server memory requested (per node), increased for 13b model
#SBATCH -t 100:00:00                                 # Time limit (hh:mm:ss)
#SBATCH --partition=gpu                             # Request partition
#SBATCH --gres=gpu:a5000:2                     # Type/number of GPUs needed

bash run_128_B.sh

However, I get the following error after 30 minutes of training:

File "/home/user/mambaforge/envs/Open_magvit/lib/python3.12/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store return TCPStore( ^^^^^^^^^ torch.distributed.DistStoreError: Timed out after 1801 seconds waiting for clients. 1/2 clients joined.

Has anybody else encountered this error when running the bash scripts? Thanks!

TencentARC / Open-MAGVIT2

training bash script bug for 128 x 128 #24