Closed cndivecat closed 1 year ago
well you have to put the parameter distributed = False in your train and test file you have these lines:
"""""""init distributed env first, since logger depends on the dist info.
if args.launcher == 'none':
distributed = False
else:
distributed = True
init_dist(args.launcher, **cfg.dist_params)
re-set gpu_ids with distributed training mode
_, world_size = get_dist_info()
cfg.gpu_ids = range(world_size)""""""""
replace them by distributed = False
also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 8' by './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 1'
it worked for me
well you have to put the parameter distributed = False in your train and test file you have these lines:
"""""""init distributed env first, since logger depends on the dist info. if args.launcher == 'none': distributed = False else: distributed = True init_dist(args.launcher, **cfg.dist_params) re-set gpu_ids with distributed training mode _, world_size = get_dist_info() cfg.gpu_ids = range(world_size)""""""""
replace them by distributed = False
also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 8' by './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 1'
it worked for me
您好,关于PCT这个项目可以联系下吗
@qiushanjun yes if you want
@gmk11 follow your step but I find another error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_clamp_Tensor)
@pydd123 can you show me the exact line where the error is ? But as i can see your model and your images are not on the same device. One is working on GPU and the other on CPU . Make sure both are either on GPU(recommend) or on CPU To pass your tensor on GPU for example you have to put : tensor.to( 'cuda:0')
@pydd123 can you show me the exact line where the error is ? But as i can see your model and your images are not on the same device. One is working on GPU and the other on CPU . Make sure both are either on GPU(recommend) or on CPU To pass your tensor on GPU for example you have to put : tensor.to( 'cuda:0')
Thanks you,I have solven this question.I think this is the swin's problem. pct_swin_v2.py line 322: logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01).cuda())).exp()
@pydd123 can you show me the exact line where the error is ? But as i can see your model and your images are not on the same device. One is working on GPU and the other on CPU . Make sure both are either on GPU(recommend) or on CPU To pass your tensor on GPU for example you have to put : tensor.to( 'cuda:0')
Thanks you,I have solven this question.I think this is the swin's problem. pct_swin_v2.py line 322: logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01).cuda())).exp()
cool , i hope everything works fine now
well you have to put the parameter distributed = False in your train and test file you have these lines:
"""""""init distributed env first, since logger depends on the dist info. if args.launcher == 'none': distributed = False else: distributed = True init_dist(args.launcher, **cfg.dist_params) re-set gpu_ids with distributed training mode _, world_size = get_dist_info() cfg.gpu_ids = range(world_size)""""""""
replace them by distributed = False
also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 8' by './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 1'
it worked for me
For me, it seems only replacing the above code with distributed = False
would give me the error RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
. After initializing the distributed setting by adding init_dist(args.launcher, **cfg.dist_params)
it worked.
But I feel like it's not non-distributed training 😭
I would appreciate it if you can give me some idea. Thanks!
well you have to put the parameter distributed = False in your train and test file you have these lines:
"""""""init distributed env first, since logger depends on the dist info. if args.launcher == 'none': distributed = False else: distributed = True init_dist(args.launcher, **cfg.dist_params) re-set gpu_ids with distributed training mode _, world_size = get_dist_info() cfg.gpu_ids = range(world_size)""""""""
replace them by distributed = False also use num_workers_gpu = 0 (in your config_file) and don't remember to use num_gpu =1 by replace the training command './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 8' by './tools/disttrain.sh configs/pct[base/large/huge]_tokenizer.py 1' it worked for me
For me, it seems only replacing the above code with
distributed = False
would give me the errorRuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
. After initializing the distributed setting by addinginit_dist(args.launcher, **cfg.dist_params)
it worked. But I feel like it's not non-distributed training sob I would appreciate it if you can give me some idea. Thanks!
did you modify your dist_train.sh file ?? maybe the problem is there . here is mine try with it : ##########code##############
CONFIG=$1 GPUS=$2 PORT=${PORT:-29500}
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \ python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \ $(dirname "$0")/train.py $CONFIG ${@:3}
请问该模型只能使用分布式训练吗?如何不使用分布式训练使用该模型?