facebookresearch / VMZ

VMZ: Model Zoo for Video Modeling
Apache License 2.0
1.04k stars 156 forks source link

DistributedDataParallel error when train the model in one server with multi-gpus? #124

Open chunniunai220ml opened 3 years ago

chunniunai220ml commented 3 years ago

Hi,when I run the train codes:

python tools/train_net.py --name=/mnt/codes/ckpts/trains --model=ir_csn_152 --resume_from_model=/mnt/codes/weights/pre_trained_weights/irCSN_152_ig65m_from_scratch_f125286141.pth --dataset=ucf101 --traindir=/mnt/codes/dataset/UCF-101/ --nodes=1 --batch-size=1 --workers=8 --epochs=45 --finetune=True --num_finetune_classes=43

meeting the error: AssertionError: DistributedDataParallel with multi-device module only works with CUDA devices, but module parameters locate in {device(type='cuda', index=0), device(type='cpu')}.

should i change the opts or some codes in pt/vmz/func/train.py?how to solve the problem?

the log.out: submitit INFO (2020-10-16 10:04:53,290) - Starting with JobEnvironment(job_id=2027, hostname=b33760dbd191, local_rank=0(1), node=0(1), global_rank=0(1)) submitit INFO (2020-10-16 10:04:53,291) - Loading pickle: /mnt/codes/ckpts/trains/2027/2027_submitted.pkl Process group: 1 tasks, rank: 0 | distributed init (rank 0): file:///mnt/codes/ckpts/trains/aa2610db20af40c0bdadaad8882d434f_init Namespace(annotation_path='', apex=False, apex_opt_level='O1', batch_size=1, crop_size=112, dataset='ucf101', device='cuda', dist_backend='nccl', dist_url='file:///mnt/codes/ckpts/trains/aa2610db20af40c0bdadaad8882d434f_init', distributed=True, epochs=45, eval_only=False, fc_lr=0.1, finetune='True', fold=1, gpu=0, l1_lr=0.001, l2_lr=0.001, l3_lr=0.001, l4_lr=0.001, lr=0.01, lr_gamma=0.1, lr_milestones=[20, 30, 40], lr_warmup_epochs=10, model='ir_csn_152', momentum=0.9, name='/mnt/codes/ckpts/trains', nodes=1, num_classes=400, num_finetune_classes=43, num_frames=16, output_dir='/mnt/codes/ckpts/trains/2027', partition='dev', pretrained='', print_freq=10, rank=0, resume='', resume_from_model='/mnt/codes/weights/irCSN_152_ig65m_from_scratch_f125286141.pth', scale_h=128, scale_w=174, start_epoch=0, sync_bn=False, train_bs_multiplier=5, train_file='', traindir='/mnt/codes/dataset/UCF-101/', val_clips_per_video=1, val_file='', valdir='/mnt/codes/dataset/val_tmp/', weight_decay=0.0001, workers=8, world_size=1) torch version: 1.6.0 torchvision version: 0.7.0

chunniunai220ml commented 3 years ago

I have solved the problem by add model.to('cuda') in if args.distributed: blocks, beacuse in if args.fineturn: the modification of the model.fc lead to the fc layer paramter.device=='cpu'.
but, I do not know how to set multi-gps? as the code : torch.nn.parallel.DistributedDataParallel(model,device_ids=[args.gpu]), and args.gpu=job_env.local_rank in __setup_gpu_args(), should i set --nodes=num_gpus to use multi-gpu training?