I'm using the following slurm file to run the bash script for training run_128_B.sh on a remote gpu cluster. This also applies for the 256 x 256 as well.
#!/bin/bash
#SBATCH -J 128_magvit_train_0 # Job name
#SBATCH -o /128_tokenizer/logs/128_magvit_train_0%j.out # output file (%j expands to jobID)
#SBATCH -e /128_tokenizer/logs/128_magvit_train_0%j.err # error log file (%j expands to jobID)
#SBATCH -N 1 # Total number of nodes requested
#SBATCH --ntasks-per-node=2 # Total number of cores requested
#SBATCH --get-user-env # retrieve the users login environment
#SBATCH --mem=200G # server memory requested (per node), increased for 13b model
#SBATCH -t 100:00:00 # Time limit (hh:mm:ss)
#SBATCH --partition=gpu # Request partition
#SBATCH --gres=gpu:a5000:2 # Type/number of GPUs needed
bash run_128_B.sh
However, I get the following error after 30 minutes of training:
File "/home/user/mambaforge/envs/Open_magvit/lib/python3.12/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
return TCPStore(
^^^^^^^^^
torch.distributed.DistStoreError: Timed out after 1801 seconds waiting for clients. 1/2 clients joined.
Has anybody else encountered this error when running the bash scripts? Thanks!
I'm using the following slurm file to run the bash script for training run_128_B.sh on a remote gpu cluster. This also applies for the 256 x 256 as well.
However, I get the following error after 30 minutes of training:
File "/home/user/mambaforge/envs/Open_magvit/lib/python3.12/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store return TCPStore( ^^^^^^^^^ torch.distributed.DistStoreError: Timed out after 1801 seconds waiting for clients. 1/2 clients joined.
Has anybody else encountered this error when running the bash scripts? Thanks!