Closed ucalyptus2 closed 1 year ago
@XJay18
Hi,
If you are using multiple gpus, you should modify the parameter --nproc_per_node
in the training scripts. For example, for training with 2 gpus:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port 12345 train.py --config path/to/config.yml
Meanwhile, please make sure the entry id
defined in the config yaml file is unique for each experiment. We create a unique folder named ${model_name}/${id}
, so if the id
is duplicated (given that ${model_name}
is not changed), the program cannot create the logging folder. If that is the case, you should either delete the previous logging folder with the same id
, or use another id
for creating a new logging folder.
the id "id name i gave" already exists by one process so rest all workers stop.