About multi-GPU training

siyuch-fdu commented 4 years ago

Hi, I have disabled the distributed training when I use multi-GPU training. But there appears a bug, which point to the different value between "len(proposal_cur)" and "x.shape[0]" in trainer.py. I have printed the value of "len(proposal_cur)" and "x.shape[0]" as below. By the way, this bug didn't appear when I use single GPU training. Can you help me to figure this bug out? Thanks a lot!

(dmm) $:~/Experiments/DMM_Net$ sh scripts/train/train_101.sh 2020-05-12 08:23:18,754-{train.py:394}-INFO-[model_name] ytb_train_x101 2020-05-12 08:23:18,755-{train.py:395}-INFO-get number of gpu: 3 2020-05-12 08:23:20,180-{utils.py:213}-INFO-[load_DMM_config] dmm/configs/train.yaml 2020-05-12 08:23:20,185-{utils.py:232}-INFO-ud relax_max_iter 400 -> 10|ud relax_proj_iter 50 -> 5 2020-05-12 08:23:25,200-{utils.py:213}-INFO-[load_DMM_config] dmm/configs/train.yaml 2020-05-12 08:23:25,208-{utils.py:232}-INFO-ud relax_max_iter 400 -> 10|ud relax_proj_iter 50 -> 5 2020-05-12 08:23:25,213-{train.py:162}-INFO-{'sort_max_num': 50, 'matching_score_thre': 0.0, 'score_weight': 0.3, 'relax': 1, 'relax_max_iter': 10, 'relax_proj_iter': 5, 'relax_topk': 0, 'relax_learning_rate': 0.1, 'matching': {'match_max_score': 1, 'algo': 'relax', 'cost': 'cosine'}, 'encoder': {'nms_thresh': 0.4}} 2020-05-12 08:23:25,342-{trainer.py:63}-INFO-load from json_data; num vid 3000 2020-05-12 08:23:25,342-{train.py:164}-INFO-init model 6.586 2020-05-12 08:23:25,344-{train.py:171}-INFO-optimizer 0.002 2020-05-12 08:23:25,344-{train.py:173}-INFO-[enc_opt] len: 2; len for each param group: [48, 314] 2020-05-12 08:23:25,344-{train.py:175}-INFO-[dec_opt] len: 1; len for each param group: [10] 2020-05-12 08:23:25,352-{train.py:222}-INFO-save args in experiments/models/ytb_train_x101/05-12-08-23args.pkl 2020-05-12 08:23:25,352-{train.py:223}-INFO-Namespace(augment=True, base_model='resnet101', batch_size=4, best_val_loss=0, cache_data=1, config_train='dmm/configs/train.yaml', dataset='youtube', davis_eval_folder='', device=device(type='cuda', index=0), distributed=0, distributed_manully=0, distributed_manully_Nrep=0, distributed_manully_rank=0, dropout=0.0, epoch_resume=0, eval_flag='pred', eval_split='trainval', finetune_after=3, gpu_id=0, gt_maxseqlen=5, hidden_size=128, imsize=480, iou_weight=1.0, kernel_size=3, length_clip=3, load_proposals=1, load_proposals_dataset=1, local_rank=0, log_file='train.log', log_term=False, loss_weight_iouraw=1.0, loss_weight_match=1.0, lr=0.0001, lr_cnn=1e-05, lr_decoder=0.001, mask_th=0.5, max_dets=100, max_epoch=2, max_eval_iter=800, maxseqlen=5, min_delta=0.0, min_size=0.001, model_dir='experiments/models/ytb_train_x101', model_name='ytb_train_x101', models_root='experiments/models/', momentum=0.9, my_augment=False, ngpus=3, num_classes=21, num_workers=4, only_spatial=False, only_temporal=False, optim='adam', optim_cnn='adam', overwrite_loadargs=1, pad_video=0, patience=15, patience_stop=60, pred_offline_meta='../data/ytb_vos/splits_813_3k_trainvaltest/meta_vid_frame_2_predid.json', pred_offline_path=['./experiments/proposals/coco81/inference/youtubevos_train3k_meta/asdict_50/videos/'], pred_offline_path_eval=['experiments/proposals/coco81/inference/youtubevos_val200_meta/asdict_50/pred_DICT.pth'], prev_mask_d=1, print_every=2, random_select_frames=1, resize=True, resume=False, resume_path='epoxx_iterxxxx', rotation=10, sample_inference_mask=0, save_every=3000, seed=123, shear=0.1, single_object=False, skip_empty_starting_frame=1, skip_mode='concat', test=0, test_image_h=256, test_image_w=448, test_model_path='', threshold_mask=0.4, train_h=255, train_split='train', train_w=448, translation=0.1, update_encoder=1, use_gpu=True, use_refmask=0, weight_decay=1e-06, weight_decay_cnn=1e-06, year='2017', youtube_dir='../../databases/YouTubeVOS/', zoom=0.7) 2020-05-12 08:23:25,353-{train.py:232}-INFO-init_dataloaders 2020-05-12 08:23:25,527-{youtubeVOS.py:84}-INFO-[dataset] phase read train; len of db seq 3000 2020-05-12 08:23:25,527-{youtubeVOS.py:103}-INFO-LMDB not found. This could affect the data loading time. It is recommended to use LMDB. 2020-05-12 08:23:25,527-{youtubeVOS.py:115}-INFO-no cache data found at data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_train.pkl; it will take a while to cache the data 2020-05-12 08:29:53,807-{youtubeVOS.py:121}-INFO-try to dump in data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_train.pkl 2020-05-12 08:29:56,467-{dataset.py:125}-INFO-+new_parts 200: 1.6958227157592773 2020-05-12 08:30:07,352-{dataset.py:125}-INFO-+new_parts 200: 12.146138191223145 2020-05-12 08:30:58,266-{youtubeVOS.py:125}-INFO-load lmdb 452.77 2020-05-12 08:31:22,048-{youtubeVOS.py:161}-INFO-filtered images out -> 444 for #vid 3000 2020-05-12 08:31:23,869-{youtubeVOS.py:253}-INFO-[init][data][youtube][load clips] load anno 25.58; cliplen 3| annotation clip 26261(skip 59)| videos 3000 2020-05-12 08:31:24,195-{youtubeVOS.py:265}-INFO-load keys 0.33 2020-05-12 08:31:24,196-{train.py:104}-INFO-INPUT shape: 255 448 2020-05-12 08:31:24,316-{dataset.py:119}-INFO-[trainval] loading offline from experiments/proposals/coco81/inference/youtubevos_val200_meta/asdict_50/pred_DICT.pth; Nf ['experiments/proposals/coco81/inference/youtubevos_val200_meta/asdict_50/pred_DICT.pth'] 2020-05-12 08:31:36,170-{dataset.py:125}-INFO-+new_parts 200: 11.853206157684326 2020-05-12 08:31:36,178-{dataset.py:133}-INFO-load offline use 11.86 | len 200 2020-05-12 08:31:36,180-{youtubeVOS.py:84}-INFO-[dataset] phase read trainval; len of db seq 200 2020-05-12 08:31:36,196-{youtubeVOS.py:103}-INFO-LMDB not found. This could affect the data loading time. It is recommended to use LMDB. 2020-05-12 08:31:36,196-{youtubeVOS.py:115}-INFO-no cache data found at data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_trainval.pkl; it will take a while to cache the data 2020-05-12 08:31:57,003-{youtubeVOS.py:121}-INFO-try to dump in data/ytb_vos/splits_813_3k_trainvaltest/dmm_cached_trainval.pkl 2020-05-12 08:31:57,383-{youtubeVOS.py:125}-INFO-load lmdb 21.20 2020-05-12 08:31:57,399-{youtubeVOS.py:161}-INFO-filtered images out -> 0 for #vid 200 2020-05-12 08:31:57,421-{youtubeVOS.py:253}-INFO-[init][data][youtube][load clips] load anno 0.04; cliplen 3| annotation clip 800(skip 6)| videos 200 2020-05-12 08:31:57,424-{youtubeVOS.py:265}-INFO-load keys 0.00 2020-05-12 08:31:57,425-{train.py:104}-INFO-INPUT shape: 255 448 2020-05-12 08:31:57,425-{train.py:237}-INFO-dataloader 512.072 2020-05-12 08:31:57,427-{train.py:258}-INFO-epoch 0 - trainval; 2020-05-12 08:31:57,427-{train.py:260}-INFO--- loss weight loss_weight_match: 1.0 loss_weight_iouraw 1.0; Traceback (most recent call last): File "train.py", line 413, in trainIters(args) File "train.py", line 285, in trainIters loss, losses = trainer(batch_idx, inputs, imgs_names, targets, seq_name, starting_frame, split, args, proposals) File "/home/csy/anaconda3/envs/dmm/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/csy/anaconda3/envs/dmm/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/csy/anaconda3/envs/dmm/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/csy/anaconda3/envs/dmm/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/csy/anaconda3/envs/dmm/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/home/csy/anaconda3/envs/dmm/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/csy/Experiments/DMM_Net/dmm/modules/trainer.py", line 112, in forward CHECKEQ(len(proposal_cur), x.shape[0]) File "/home/csy/Experiments/DMM_Net/dmm/utils/checker.py", line 27, in CHECKEQ assert(a == b), 'get {} {}'.format(a, b) AssertionError: get 2 1 len(proposal_cur): 2 x.shape[0]: 1 len(proposal_cur): 2 x.shape[0]: 1

ZENGXH commented 4 years ago

Are you using 3 gpu, but with the distributed training disabled?
Could you try with distributed training enabled? as here: https://github.com/ZENGXH/DMM_Net/blob/a6308688cbcf411db9072aa68efbe485dde02a9b/scripts/train/train_101.sh#L28 and distributed turned on. Distributed training for multi-gpu training (even on single machine) is faster.

I didn't test the non-distributed version before. I guess the problem is caused by the different behavior of DataParallel and DistributedDataParallel. If you really need to use DataParallel, I can try my best to help if you share you training scripts.

Thanks,

siyuch-fdu commented 4 years ago

I really appreciate your kindly help. I have tried distributed training before, but it was not work. Fortunately, distributed training is working well on 2-GPUs today, and I think I can make it on 3-GPUs later.

Besides, could you teach me how to use LMDB. I have installed LMDB, but the code always remind me that "LMDB not found" as below. This caused the GPU and CPU utilization very low.

2020-05-13 07:43:55,403-{youtubeVOS.py:109}-INFO-LMDB not found. This could affect the data loading time. It is recommended to use LMDB.

I have checked whether I got the "lmdb_env_annot_dir" and "lmdb_env_annot_dir", but I didn't found them in my youtubeVOS dataset directory.

Could you please help me figure this problem out? Thank you!

ZENGXH commented 4 years ago

I didn't use the LMDB either. I think not suing lmdb should be fine cause the loaded data will be cached as .pkl anyway.

ZENGXH / DMM_Net

About multi-GPU training #8