I believe the script already supports DistributedDataParallel and we used 8 A100 GPUs for training as we mentioned in the paper. Have u set the CUDA_VISIBLE_DEVICES and --nproc_per_node to the correct GPU env in ur training script?
See the following script for an example:
CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun --master_port 10000 --nproc_per_node 4 train_tiktok.py \
This will use GPU 4,5,6,7 and run 4 processes(nproc_per_node=4) for training.
In a similar way, if you'd like to use 8 GPUs:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --master_port 10000 --nproc_per_node 8 train_tiktok.py \
I believe the script already supports DistributedDataParallel and we used 8 A100 GPUs for training as we mentioned in the paper. Have u set the CUDA_VISIBLE_DEVICES and --nproc_per_node to the correct GPU env in ur training script?
See the following script for an example:
CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun --master_port 10000 --nproc_per_node 4 train_tiktok.py \
This will use GPU 4,5,6,7 and run 4 processes(nproc_per_node=4) for training.In a similar way, if you'd like to use 8 GPUs:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --master_port 10000 --nproc_per_node 8 train_tiktok.py \