cannot use with distributed pytorch

ucalyptus2 commented 1 year ago

the id "id name i gave" already exists by one process so rest all workers stop.

ucalyptus2 commented 1 year ago

@XJay18

XJay18 commented 1 year ago

Hi, If you are using multiple gpus, you should modify the parameter --nproc_per_node in the training scripts. For example, for training with 2 gpus:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port 12345 train.py --config path/to/config.yml

Meanwhile, please make sure the entry id defined in the config yaml file is unique for each experiment. We create a unique folder named ${model_name}/${id}, so if the id is duplicated (given that ${model_name} is not changed), the program cannot create the logging folder. If that is the case, you should either delete the previous logging folder with the same id, or use another id for creating a new logging folder.

VISION-SJTU / RECCE

cannot use with distributed pytorch #8