KeyError: 'RANK', while training with --parallel_traning 1

SRajasekar333 commented 5 months ago

Hi, I was able to successfully run the train.py with --parallel_training 0 and obtained the output files. But when I try to run with --parallel_training 1, after the GPU resource allocation srun --nodes=1 --gres=gpu:1 --time=01:00:00 --pty bash, I face the following error, please help on how to resolve this. Thanks in advance.

Kait0 commented 5 months ago

When using the parallel_training option you need to start your script with torchrun, see here

SRajasekar333 commented 5 months ago

Thanks, now it works. Could you please tell what should be the optimal hyperparameters to train the model for efficient results,

Currently I did a trial_model training for the complete 210GB dataset by running,

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 OMP_NUM_THREADS=24 OPENBLAS_NUM_THREADS=1 torchrun --nnodes=1 --nproc_per_node=7 --max_restarts=0 --rdzv_id=1234576890 --rdzv_backend=c10d train.py --logdir log --root_dir /transfuser/data --parallel_training 1 --epochs 10 --batch_size 32 --id transfuser

so to train this trial_model it took me approximately 4hrs to complete, so please provide what could be the best parameters to train the model considering the mentioned resources available? (Will --epochs 30 --batch_size 12 be fine?).

Thanks in advance.

Kait0 commented 5 months ago

With 7 GPUs you can probably just use the default parameters. If you want to optimize for speed you can only train for 31 epochs, and increase the batch size to maximize GPU usage (If you increase the bs a lot you might want to increase the lr as well as you are taking less gradient steps with higher batch size).

autonomousvision / transfuser

KeyError: 'RANK', while training with --parallel_traning 1 #210