Open geonoon opened 5 months ago
@geonoon In fact, we have considered two cases for distributed pretraining: SLURM and server, but I'm not sure whether the main_pretrain.py of MTP can be implemented on the server, maybe you can refer to this, to revise the codes related to the distributed pretraining.
Here is a command example:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 \
--nnodes=1 --master_port=10001 --master_addr = [server ip] main_pretrain.py
Thank you for this amazing project.
I tried to perform pretraining on a single machine, with a Nvidia A100 GPU, or just with a CPU, but it could not work through.
It seems the script file main_pretrain.py needs to be modified somehow.
Could you offer help in detail on this matter?
Thanks in advance.