Open txytju opened 5 years ago
Hi,
I think you could achieve something like that by just removing the last element in https://github.com/facebookresearch/maskrcnn-benchmark/blob/f25c6cff92d32d92abe8965d68401004e90c8bee/configs/e2e_faster_rcnn_R_50_FPN_1x.yaml#L18
to be (0.25, 0.125, 0.0625)
.
It might work out of the box, but I'm not 100% sure now.
Can you try that first?
Should the length of POOLER_SCALES
be the same as ANCHOR_STRIDE
? And the length of them should be the same as the number of feature maps output by the FPN. Is that right?
ANCHOR_STRIDE
should be the same as the number of feature maps in the FPN.
But we can limit the number of pooled feature maps by just reducing POOLER_SCALES
I believe.
Let me know if this doesn't work, I might be missing something.
I also meet this problem, have u solved it ?
@HOPEver1991 what was the error message? And in what context did you see it? (what did you change in the implementation?)
@HOPEver1991 what was the error message? And in what context did you see it? (what did you change in the implementation?)
Thank you for your reply !
I have a layer tensor, and now I need to do convolution operation to it for several times. The convolutions are all different. Similar to @txytju , the code works well on a single GPU but fails in the distributed environment. The error is as follows:
Traceback (most recent call last):
File "tools/train_net.py", line 174, in
Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7f75aa3e0ae8>, [[tensor([[[[0.]],
@HOPEver1991 maybe one of the GPUs has a different computation graph?
Having a minimum reproducible example would help a lot as well identifying the issue.
@fmassa Hi, I encountered the same problem. I tried to use SENet as backbone network. Based on your repo, I completed the task.
When the whole model is trained on single GPU, it works totally right. However, with more than one GPU, the same error is raised @txytju and @HOPEver1991 referred.
I'm not asking you to fix the problem, but if you need to figure out the cause of this issue, I'd like to share my current code with you. If you need, please notice me later.
@mikigom sharing the code would be very helpful, but I might not have the time to dig too much into it in the near future unfortunately
Hi, I have met same problem with distributed training. The same error is raised @txytju and @HOPEver1991 referred. I tried to use one node with multi GPUs but it failed to backward. I will try to provide a minimum reprobucible example but since I have changed a lot in this repository, it will cost some time. I will be appreciate if you can provide some suggestions! Thanks!
@Lausannen When you say it failed on backward, does this mean that it raised an error or was it stuck?
@fmassa Thank you ! Sorry for my late reply, " it failed on backward" means that it raised an error. The error info was the same as txytju's "TypeError: _queue_reduction(): incompatible function arguments."
When I tried to solved the problem, I found something may be helpful for you. In https://github.com/pytorch/pytorch/issues/13273, they discussed DDP support, and one man suggested NVIDIA distributed module wrapper, apex.parallel.DistributedDataParallel
. I took a try, and the code successfully ran. I think maybe my model also has some layers or parameters not used since I am a new one joining in Deep Learning , I am not sure about this. Hopefully my discovery can help you, if you need me to provide any other information, please let me know. Thank you for your quick reply again !
After I saw @Lausannen 's reply, I tried to remove all unused layers in my new backbone and it totally solved my problem (referring to http://pytorch/pytorch#13273.) Thus, for my case, it is PyTorch's issue rather than this repo's issue. Thank you. @fmassa
@mikigom Hi, thank you for your test and reply. If it does not bother you, can you tell me how to make sure which layers are not used in one model ?
Awesome, good to know that this was the issue!
@Lausannen I strongly recommend that you check your forward()
in your nn.Module
backbone. In ordinary usage, only parameters used in forward()
are practically used parameters in nn.Module
. Compare all declared class variables with variables used in forward()
.
nn.Module.parameters()
or nn.Module.named_parameters()
may return all parameters in class nn.Module
(whether they are used in forward()
or not).
@mikigom Thank you for your reply ! I will check my code.
I also met the same problem and the solution is to remove the non-used parameters.
Try to add the following line in the code.
for name, param in model. named_parameters():
print(name, param, True if param.grad is not None else False)
After backward
, if the parameter does not contain grad
, it means the parameter is either frozen or not used in the forward.
Thanks for the comment @chengyangfu !
This is indeed a problem, and apparently one potential solution is also to switch to apex DDP, as discussed in https://github.com/pytorch/pytorch/issues/13273
I have also met this problem, and I am trying to re-config this envrionment, but it doesnot work. Thanks all. And I have another question, can anyone provide a new version of multi-gpu training code without the deprecated code?
@xllau the new version of the codebase uses the new distributed
backend of PyTorch
Hi, I have found a solution. I installed the pytorch nightly with the version of pytorch-nightly 1.0.0.dev20190207 py3.6_cuda9.0.176_cudnn7.4.2_0, this doesnot work. And I have also tried with python 3.6.1, 3.6.3, 3.6.5, 3.7.x, they all cannot work. Finally, I come with the following config, it works: python 3.6.8 h0371630_0 defaults pytorch-nightly 1.0.0.dev20190128 py3.6_cuda9.0.176_cudnn7.4.1_0 pytorch I have a fully version of conda-env config, follow this link:https://github.com/xllau/maskrcnn-benchmark/blob/master/conda_mb.yaml , download and run:conda create --file ./conda_mb.yaml, it will be installed automatically.
@xllau so with newer versions of the pytorch nightly it doesn't work, but with a more ancient one it works, is that right?
@xllau so with newer versions of the pytorch nightly it doesn't work, but with a more ancient one it works, is that right?
Yes, the version control is something nasty!
@xllau and the error you get with a recent PyTorch is exactly the same one as in the description of this issue?
Does anyone know the performance impact of figuring out which gradients are zero, and setting those to be not trainable?
The problem is that I do multi-task training, so the tasks that aren't being trained at the time are unused parts of the model. Is this an acceptable patch?
I meet the same issue. By removing the unused layers, everything works well. However, I need to train the model with these layers in some epochs, and skip these layers in other epochs to get a better trained model. Is there existed same better coding method to remove and add the layers dynamically? Thanks at first.
@LittleLampChen I recommend you to use NVIDIA Apex to wrap your model by setting delay_allreduce=True. In this mode, Apex can collect variables which should be computed gradients after one epoch finished. It can adjust compute graph each epoch. I think this can help you with your condition.
@Lausannen Thank you very much! I will have a try.
Hi @LittleLampChen ,
Another way is to multiply 0 to the loss you don't want.
This is not the best solution but temporarily it works well now. So, first, you need to calculate all the losses(I assume you run some multitasks training.). Then multiply 0 to the losses you don't need for this iteration.
@chengyangfu This may be a better way.
@fmassa I also meet same problem,can you help me,thank you.
Traceback (most recent call last):
File "tools/train_net.py", line 174, in
Hi @LittleLampChen , Another way is to multiply 0 to the loss you don't want. This is not the best solution but temporarily it works well now. So, first, you need to calculate all the losses(I assume you run some multitasks training.). Then multiply 0 to the losses you don't need for this iteration.
Yeah. Also, we can return the loss zero. e.g. return loss.zero_()
@fmassa I've implemented a multi-gpu training code by only torch.distributed.all_reduce
function in my project. In some cases, when some tensor.requires_grad is True
, the tensor.grad is None
. The all reduce should not applied on such tensors.
I think it leads to the error above.
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]
The List[List[at:Tensor]]
requirement breaks because of the grad None
involved. So should the torch/nn/parallel/distributed.py
package handle the grad None
case?
❓ Questions and Help
I tried to use just P2-P4 of FPN and just modified a few lines of code. The code works well on a single GPU but when using more than one GPUs, the error bellow is encountered.
The main modifications that I made is in forward function of fpn.py