分布式训练 - Githubissues

cndivecat commented 1 year ago

请问该模型只能使用分布式训练吗？如何不使用分布式训练使用该模型？

gmk11 commented 1 year ago

well you have to put the parameter distributed = False in your train and test file you have these lines:

 """""""init distributed env first, since logger depends on the dist info.
 if args.launcher == 'none':
     distributed = False
else:
     distributed = True
     init_dist(args.launcher, **cfg.dist_params)
     re-set gpu_ids with distributed training mode
     _, world_size = get_dist_info()
     cfg.gpu_ids = range(world_size)""""""""

replace them by distributed = False

also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 8' by './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 1'

it worked for me

qiushanjun commented 1 year ago

well you have to put the parameter distributed = False in your train and test file you have these lines:
 """""""init distributed env first, since logger depends on the dist info.
 if args.launcher == 'none':
     distributed = False
else:
     distributed = True
     init_dist(args.launcher, **cfg.dist_params)
     re-set gpu_ids with distributed training mode
     _, world_size = get_dist_info()
     cfg.gpu_ids = range(world_size)""""""""
replace them by distributed = False

also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 8' by './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 1'

it worked for me

您好，关于PCT这个项目可以联系下吗

gmk11 commented 1 year ago

@qiushanjun yes if you want

pydd123 commented 1 year ago

@gmk11 follow your step but I find another error： RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_clamp_Tensor)

gmk11 commented 1 year ago

@pydd123 can you show me the exact line where the error is ? But as i can see your model and your images are not on the same device. One is working on GPU and the other on CPU . Make sure both are either on GPU(recommend) or on CPU To pass your tensor on GPU for example you have to put : tensor.to( 'cuda:0')

pydd123 commented 1 year ago

@pydd123 can you show me the exact line where the error is ? But as i can see your model and your images are not on the same device. One is working on GPU and the other on CPU . Make sure both are either on GPU(recommend) or on CPU To pass your tensor on GPU for example you have to put : tensor.to( 'cuda:0')

Thanks you,I have solven this question.I think this is the swin's problem. pct_swin_v2.py line 322: logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01).cuda())).exp()

gmk11 commented 1 year ago

@pydd123 can you show me the exact line where the error is ? But as i can see your model and your images are not on the same device. One is working on GPU and the other on CPU . Make sure both are either on GPU(recommend) or on CPU To pass your tensor on GPU for example you have to put : tensor.to( 'cuda:0')

Thanks you,I have solven this question.I think this is the swin's problem. pct_swin_v2.py line 322: logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01).cuda())).exp()

cool , i hope everything works fine now

wongzingji commented 1 year ago

well you have to put the parameter distributed = False in your train and test file you have these lines:
 """""""init distributed env first, since logger depends on the dist info.
 if args.launcher == 'none':
     distributed = False
else:
     distributed = True
     init_dist(args.launcher, **cfg.dist_params)
     re-set gpu_ids with distributed training mode
     _, world_size = get_dist_info()
     cfg.gpu_ids = range(world_size)""""""""
replace them by distributed = False

also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 8' by './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 1'

it worked for me

For me, it seems only replacing the above code with distributed = False would give me the error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.. After initializing the distributed setting by adding init_dist(args.launcher, **cfg.dist_params) it worked. But I feel like it's not non-distributed training 😭 I would appreciate it if you can give me some idea. Thanks!

gmk11 commented 1 year ago

well you have to put the parameter distributed = False in your train and test file you have these lines:
 """""""init distributed env first, since logger depends on the dist info.
 if args.launcher == 'none':
     distributed = False
else:
     distributed = True
     init_dist(args.launcher, **cfg.dist_params)
     re-set gpu_ids with distributed training mode
     _, world_size = get_dist_info()
     cfg.gpu_ids = range(world_size)""""""""
replace them by distributed = False also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 8' by './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 1' it worked for me
For me, it seems only replacing the above code with distributed = False would give me the error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.. After initializing the distributed setting by adding init_dist(args.launcher, **cfg.dist_params) it worked. But I feel like it's not non-distributed training sob I would appreciate it if you can give me some idea. Thanks!

did you modify your dist_train.sh file ?? maybe the problem is there . here is mine try with it : ##########code##############

CONFIG=$1 GPUS=$2 PORT=${PORT:-29500}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \ python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \ $(dirname "$0")/train.py $CONFIG ${@:3}

Gengzigang / PCT

分布式训练 #3