Distributed validation - Githubissues

miguelvr commented 5 years ago

❓ Questions and Help

I'm working on a branch where I implemented validation inference at every checkpoint.

Everything was working fine until the new changes from torch.deprecated.distributed to the new torch.distributed

Now either the Dataloader breaks on one of the processes or, if I run the inference in the main process, it hangs there forever.

sample code:

        if iteration % checkpoint_period == 0 or iteration == max_iter:
            checkpointer.save("model_{:07d}".format(iteration), **arguments)
            if is_main_process():
                if val_data_loader is not None:
                    logger.info('Evaluating on validation data set')
                    iou_types = ("bbox",)
                    if cfg.MODEL.MASK_ON:
                        iou_types = iou_types + ("segm",)

                    inference(
                        model,
                        val_data_loader,
                        iou_types=iou_types,
                        box_only=cfg.MODEL.RPN_ONLY,
                        device=cfg.MODEL.DEVICE,
                        expected_results=cfg.TEST.EXPECTED_RESULTS,
                        expected_results_sigma_tol=cfg.TEST.EXPECTED_RESULTS_SIGMA_TOL,
                        verbose=False
                    )

                model.train()  # reset training flag
            synchronize()

Any work around this?

fmassa commented 5 years ago

Don't you need to unwrap the model from DistributedDataParallel before feeding it to inference? By that I mean passing model.module instead of model?

miguelvr commented 5 years ago

That doesn't make a difference. Same behaviour

fmassa commented 5 years ago

I don't actually know what else could be the problem, I'd need to check it out to identify where it hangs.

Can you maybe attach a gdb and print the stack trace where it hangs?

python tools/train_net.py --config-file /path/to/config

and once it hangs, attach the gdb

gdb attach <pid>

and then run

thread apply all bt

and paste the results?

BobZhangHT commented 5 years ago

@fmassa Hi! I also meet the same problem. I follow your instruction to attach gdb and run the corresponding codes. Here is my results:

Thread 7 (Thread 0x7f6033568700 (LWP 3350)):
#0  0x00007f6087f6803f in accept4 () from /lib64/libc.so.6
#1  0x00007f6073b644a6 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007f6073b58a3d in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007f6073b65110 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007f608823ce25 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f6087f66bad in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f6032d67700 (LWP 3351)):
#0  0x00007f6087f5bf0d in poll () from /lib64/libc.so.6
#1  0x00007f60712478e9 in c10d::TCPStoreDaemon::run() ()
   from /home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#2  0x00007f6071e61dc0 in ?? ()
   from /home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/../../../libstdc++.so.6
#3  0x00007f608823ce25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f6087f66bad in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f602a6fa700 (LWP 3512)):
#0  0x00007f6087f5bf0d in poll () from /lib64/libc.so.6
#1  0x00007f6073b6369b in ?? () from /usr/lib64/nvidia/libcuda.so.1

Could you please provide any suggestions?

fmassa commented 5 years ago

@BobZhangHT we might need the full log for that. Does it still hang on single-machine training?

BobZhangHT commented 5 years ago

@fmassa

Thanks for your reply. I modified my codes and it did not hang on since then. Actually I successfully run several rounds after that and suddenly got stuck in a new problem about distributed training. Here is the log, and I update my pytorch to the latest version but it does not work.

File "tools/train_net.py", line 230, in <module>
    main()
  File "tools/train_net.py", line 223, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 121, in train
    masksgt,)# add 2019/01/11 
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 106, in do_train
    loss_dict = model(images, targets)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 364, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 100, in forward
    return self._forward_train(anchors, objectness, rpn_box_regression, targets)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 116, in _forward_train
    anchors, objectness, rpn_box_regression, targets
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 138, in forward
    sampled_boxes.append(self.forward_for_single_feature_map(a, o, b))
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 113, in forward_for_single_feature_map
    boxlist = remove_small_boxes(boxlist, self.min_size)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 46, in remove_small_boxes
    (ws >= min_size) & (hs >= min_size)
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called without an active exception

BobZhangHT commented 5 years ago

@fmassa Hi. I additionally tracked the training every iteration and found that it can actually train for the first 7 or 9 iterations, then break with the following error. It seems that something's index is out of bound in cuda but I can not figure out what it is. Here is the part of the error information, since some of them are in a repeated pattern so I just provide a small fraction.

/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [76,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [77,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [78,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [79,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [52,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [54,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [58,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [60,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [61,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [62,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [8,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
....
Traceback (most recent call last):
  File "tools/train_net.py", line 237, in <module>
    main()
  File "tools/train_net.py", line 230, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 138, in train
    arguments,)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 109, in do_train
    loss_dict = model(images, targets)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 364, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 100, in forward
    return self._forward_train(anchors, objectness, rpn_box_regression, targets)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 116, in _forward_train
    anchors, objectness, rpn_box_regression, targets
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 138, in forward
    sampled_boxes.append(self.forward_for_single_feature_map(a, o, b))
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 113, in forward_for_single_feature_map
    boxlist = remove_small_boxes(boxlist, self.min_size)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 46, in remove_small_boxes
    (ws >= min_size) & (hs >= min_size)
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called without an active exception
terminate called without an active exception

Update: I also found that it may take place occasionally even for single GPU training.

fmassa commented 5 years ago

I would say that you have in one of your training data an out-of-bound indexing or something like that. Maybe one of your training examples has an GT index which is larger than NUM_CLASSES - 1?

BobZhangHT commented 5 years ago

@fmassa Sincerely thanks for your reply. After I double check my codes, I still cannot figure out where my training data's indices are out-of-bound. Here is my code to generate COCO format data. All ids start from 1 and there are 2 classes in total (including the background).

def coco_dict_gen(files,imagesFile,masksFile):
    img_lst=[]
    anno_lst=[]

    for i,file in enumerate(tqdm(files)):

        # image dict
        image_dict={}
        # image file directory
        image_dir=imagesFile+file
        # load the image
        image=plt.imread(image_dir,'RGB')[:,:,:3]
        # specify the image dict
        image_dict['file_name']=image_dir
        image_dict['height']=int(image.shape[0])
        image_dict['width']=int(image.shape[1])
        image_dict['id']=i+1
        # append
        img_lst.append(image_dict)

        # annotation dict
        anno_dict={}
        # load mask
        mask_dir=masksFile+file
        mask=plt.imread(mask_dir,'RGB')[:,:,:3]
        # convert RGB mask into gray mask
        bimask=cv2.cvtColor(mask,cv2.COLOR_RGB2GRAY).astype(np.uint8)
        poly=bimask_to_polygon(bimask)
        bimask=np.asfortranarray(bimask,dtype=np.uint8) # convert to fortran format
        rle_cprs=mask_utils.encode(bimask)
        # save dict
        anno_dict['image_id']=i+1
        anno_dict['id']=i+1
        anno_dict['category_id']=1
        anno_dict['iscrowd']=0
        anno_dict['segmentation']=poly#rle_uncprs
        anno_dict['area']=float(mask_utils.area(rle_cprs))
        anno_dict['bbox']=list(mask_utils.toBbox(rle_cprs))
        # append
        anno_lst.append(anno_dict)

    data_dict={}
    data_dict['info']={}
    data_dict['licenses']=[]
    data_dict['images']=img_lst
    data_dict['annotations']=anno_lst
    data_dict['categories']=[{'supercategory': 'Slice', 'id': 1, 'name': 'Lesion Slice'}]

    return data_dict

fmassa commented 5 years ago

So, you have two classes, including background, or excluding background?

I'd check that if your indices go up to 10, then NUM_CLASSES should be 11.

BobZhangHT commented 5 years ago

Including background actually. So I set the class as 2 (1 for object and 1 for background). Besides, I found a very confusing fact. Under a single GPU, If I use the default setting of e2e_mask_rcnn_R_50_FPN_1x.yaml, that is, if I use

python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 2 TEST.IMS_PER_BATCH 1

to train my model, I could not avoid this error. But if I turn to this

python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 60000 SOLVER.STEPS "(30000, 40000)" TEST.IMS_PER_BATCH 1

or

python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

I can train. I don't know why, it's so weird.

fmassa commented 5 years ago

Oh, your model probably diverged during training, because you were using a too large learning rate for batch size 2.

Given that the problem doesn't seem to be related to distributed validation, if the error persists would you mind opening a new issue?

Thanks

BobZhangHT commented 5 years ago

@fmassa Thanks you sooooo much! After tuning the learning rate, it works! As for the distributed case, I used to followe the lr_schedule in Detectron where the lr x2 for 2 GPUs (lr=0.005), it seems that the lr is still too large so I turn it back to 0.0025.

1119066022 commented 4 years ago

@BobZhangHT @fmassa hello i use the following config for Single-gpu cityscapes instance seg training SOLVER.IMS_PER_BATCH 1 SOLVER.BASE_LR 0.00125 Steps=32000

so should i change the config as follows when i use two gpus? SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.00125 #or 0.0025? Steps=16000 #or 8000?

thanks!

facebookresearch / maskrcnn-benchmark

Distributed validation #275

❓ Questions and Help