facebookresearch / maskrcnn-benchmark

Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.
MIT License
9.3k stars 2.49k forks source link

distributed error encountered #318

Open txytju opened 5 years ago

txytju commented 5 years ago

❓ Questions and Help

I tried to use just P2-P4 of FPN and just modified a few lines of code. The code works well on a single GPU but when using more than one GPUs, the error bellow is encountered.

Traceback (most recent call last):
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 251, in <module>
    main()
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 244, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 153, in train
    arguments,
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 81, in do_train
    losses.backward()
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 384, in distributed_data_parallel_hook
    self._queue_reduction(bucket_idx)
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 413, in _queue_reduction
    self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]
Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7ffb97f8c180>, [[tensor([[[[0.]],

The main modifications that I made is in forward function of fpn.py

        # just use P2-P4 rather than P2-P5
        # use_P5 is bool, FPN outputs P2-P5 when use_P5==True and P2-P4 when False
        if not self.use_P5:
            results.pop()
fmassa commented 5 years ago

Hi,

I think you could achieve something like that by just removing the last element in https://github.com/facebookresearch/maskrcnn-benchmark/blob/f25c6cff92d32d92abe8965d68401004e90c8bee/configs/e2e_faster_rcnn_R_50_FPN_1x.yaml#L18 to be (0.25, 0.125, 0.0625). It might work out of the box, but I'm not 100% sure now.

Can you try that first?

txytju commented 5 years ago

Should the length of POOLER_SCALES be the same as ANCHOR_STRIDE ? And the length of them should be the same as the number of feature maps output by the FPN. Is that right?

fmassa commented 5 years ago

ANCHOR_STRIDE should be the same as the number of feature maps in the FPN. But we can limit the number of pooled feature maps by just reducing POOLER_SCALES I believe.

Let me know if this doesn't work, I might be missing something.

HOPEver1991 commented 5 years ago

I also meet this problem, have u solved it ?

fmassa commented 5 years ago

@HOPEver1991 what was the error message? And in what context did you see it? (what did you change in the implementation?)

HOPEver1991 commented 5 years ago

@HOPEver1991 what was the error message? And in what context did you see it? (what did you change in the implementation?)

Thank you for your reply !

I have a layer tensor, and now I need to do convolution operation to it for several times. The convolutions are all different. Similar to @txytju , the code works well on a single GPU but fails in the distributed environment. The error is as follows:

Traceback (most recent call last): File "tools/train_net.py", line 174, in main() File "tools/train_net.py", line 167, in main model = train(cfg, args.local_rank, args.distributed) File "tools/train_net.py", line 76, in train arguments, File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/maskrcnn_benchmark/engine/trainer.py", line 76, in do_train losses.backward() File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook self._queue_reduction(bucket_idx) File "/opt/conda/envs/python3.6/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction self.device_ids) TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:

  1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7f75aa3e0ae8>, [[tensor([[[[0.]],

fmassa commented 5 years ago

@HOPEver1991 maybe one of the GPUs has a different computation graph?

Having a minimum reproducible example would help a lot as well identifying the issue.

mikigom commented 5 years ago

@fmassa Hi, I encountered the same problem. I tried to use SENet as backbone network. Based on your repo, I completed the task.

When the whole model is trained on single GPU, it works totally right. However, with more than one GPU, the same error is raised @txytju and @HOPEver1991 referred.

I'm not asking you to fix the problem, but if you need to figure out the cause of this issue, I'd like to share my current code with you. If you need, please notice me later.

fmassa commented 5 years ago

@mikigom sharing the code would be very helpful, but I might not have the time to dig too much into it in the near future unfortunately

Lausannen commented 5 years ago

Hi, I have met same problem with distributed training. The same error is raised @txytju and @HOPEver1991 referred. I tried to use one node with multi GPUs but it failed to backward. I will try to provide a minimum reprobucible example but since I have changed a lot in this repository, it will cost some time. I will be appreciate if you can provide some suggestions! Thanks!

fmassa commented 5 years ago

@Lausannen When you say it failed on backward, does this mean that it raised an error or was it stuck?

Lausannen commented 5 years ago

@fmassa Thank you ! Sorry for my late reply, " it failed on backward" means that it raised an error. The error info was the same as txytju's "TypeError: _queue_reduction(): incompatible function arguments." When I tried to solved the problem, I found something may be helpful for you. In https://github.com/pytorch/pytorch/issues/13273, they discussed DDP support, and one man suggested NVIDIA distributed module wrapper, apex.parallel.DistributedDataParallel. I took a try, and the code successfully ran. I think maybe my model also has some layers or parameters not used since I am a new one joining in Deep Learning , I am not sure about this. Hopefully my discovery can help you, if you need me to provide any other information, please let me know. Thank you for your quick reply again !

mikigom commented 5 years ago

After I saw @Lausannen 's reply, I tried to remove all unused layers in my new backbone and it totally solved my problem (referring to http://pytorch/pytorch#13273.) Thus, for my case, it is PyTorch's issue rather than this repo's issue. Thank you. @fmassa

Lausannen commented 5 years ago

@mikigom Hi, thank you for your test and reply. If it does not bother you, can you tell me how to make sure which layers are not used in one model ?

fmassa commented 5 years ago

Awesome, good to know that this was the issue!

mikigom commented 5 years ago

@Lausannen I strongly recommend that you check your forward() in your nn.Module backbone. In ordinary usage, only parameters used in forward() are practically used parameters in nn.Module. Compare all declared class variables with variables used in forward().

nn.Module.parameters() or nn.Module.named_parameters() may return all parameters in class nn.Module (whether they are used in forward() or not).

Lausannen commented 5 years ago

@mikigom Thank you for your reply ! I will check my code.

chengyangfu commented 5 years ago

I also met the same problem and the solution is to remove the non-used parameters.

Try to add the following line in the code.

for name, param in model. named_parameters(): 
    print(name, param, True if param.grad is not None else False)

After backward, if the parameter does not contain grad, it means the parameter is either frozen or not used in the forward.

fmassa commented 5 years ago

Thanks for the comment @chengyangfu !

This is indeed a problem, and apparently one potential solution is also to switch to apex DDP, as discussed in https://github.com/pytorch/pytorch/issues/13273

xllau commented 5 years ago

I have also met this problem, and I am trying to re-config this envrionment, but it doesnot work. Thanks all. And I have another question, can anyone provide a new version of multi-gpu training code without the deprecated code?

fmassa commented 5 years ago

@xllau the new version of the codebase uses the new distributed backend of PyTorch

xllau commented 5 years ago

Hi, I have found a solution. I installed the pytorch nightly with the version of pytorch-nightly 1.0.0.dev20190207 py3.6_cuda9.0.176_cudnn7.4.2_0, this doesnot work. And I have also tried with python 3.6.1, 3.6.3, 3.6.5, 3.7.x, they all cannot work. Finally, I come with the following config, it works: python 3.6.8 h0371630_0 defaults pytorch-nightly 1.0.0.dev20190128 py3.6_cuda9.0.176_cudnn7.4.1_0 pytorch I have a fully version of conda-env config, follow this link:https://github.com/xllau/maskrcnn-benchmark/blob/master/conda_mb.yaml , download and run:conda create --file ./conda_mb.yaml, it will be installed automatically.

fmassa commented 5 years ago

@xllau so with newer versions of the pytorch nightly it doesn't work, but with a more ancient one it works, is that right?

xllau commented 5 years ago

@xllau so with newer versions of the pytorch nightly it doesn't work, but with a more ancient one it works, is that right?

Yes, the version control is something nasty!

fmassa commented 5 years ago

@xllau and the error you get with a recent PyTorch is exactly the same one as in the description of this issue?

moinnadeem commented 5 years ago

Does anyone know the performance impact of figuring out which gradients are zero, and setting those to be not trainable?

The problem is that I do multi-task training, so the tasks that aren't being trained at the time are unused parts of the model. Is this an acceptable patch?

densechen commented 5 years ago

I meet the same issue. By removing the unused layers, everything works well. However, I need to train the model with these layers in some epochs, and skip these layers in other epochs to get a better trained model. Is there existed same better coding method to remove and add the layers dynamically? Thanks at first.

Lausannen commented 5 years ago

@LittleLampChen I recommend you to use NVIDIA Apex to wrap your model by setting delay_allreduce=True. In this mode, Apex can collect variables which should be computed gradients after one epoch finished. It can adjust compute graph each epoch. I think this can help you with your condition.

densechen commented 5 years ago

@Lausannen Thank you very much! I will have a try.

chengyangfu commented 5 years ago

Hi @LittleLampChen , Another way is to multiply 0 to the loss you don't want.
This is not the best solution but temporarily it works well now. So, first, you need to calculate all the losses(I assume you run some multitasks training.). Then multiply 0 to the losses you don't need for this iteration.

densechen commented 5 years ago

@chengyangfu This may be a better way.

linhuaiyuan commented 5 years ago

@fmassa I also meet same problem,can you help me,thank you. Traceback (most recent call last): File "tools/train_net.py", line 174, in main() File "tools/train_net.py", line 167, in main model = train(cfg, args.local_rank, args.distributed) File "tools/train_net.py", line 60, in train start_iter=arguments["iteration"], File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\maskrcnn_benchmark-0.1-py3.6-win-amd64.egg\maskrcnn_benchmark\data\build.py", line 154, in make_data_loader datasets = build_dataset(dataset_list, transforms, DatasetCatalog, is_train) File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\maskrcnn_benchmark-0.1-py3.6-win-amd64.egg\maskrcnn_benchmark\data\build.py", line 44, in build_dataset dataset = factory(**args) File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\maskrcnn_benchmark-0.1-py3.6-win-amd64.egg\maskrcnn_benchmark\data\dataset \coco.py", line 43, in init super(COCODataset, self).init(root, ann_file) File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\torchvision\datasets\coco.py", line 97, in init self.coco = COC(annFile) File "C:\Users\Caesar\anaconda\envs\pytorch\lib\site-packages\pycocotools\coco.py", line 85, in init dataset = json.load(open(annotation_file, 'r')) FileNotFoundError: [Errno 2] No such file or directory: 'datasets\coco/annotations/instances_train2017.json'

chenjoya commented 5 years ago

Hi @LittleLampChen , Another way is to multiply 0 to the loss you don't want. This is not the best solution but temporarily it works well now. So, first, you need to calculate all the losses(I assume you run some multitasks training.). Then multiply 0 to the losses you don't need for this iteration.

Yeah. Also, we can return the loss zero. e.g. return loss.zero_()

samson-wang commented 5 years ago

@fmassa I've implemented a multi-gpu training code by only torch.distributed.all_reduce function in my project. In some cases, when some tensor.requires_grad is True, the tensor.grad is None. The all reduce should not applied on such tensors.

I think it leads to the error above.

TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

The List[List[at:Tensor]] requirement breaks because of the grad None involved. So should the torch/nn/parallel/distributed.py package handle the grad None case?