WongKinYiu / yolor

implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks (https://arxiv.org/abs/2105.04206)
GNU General Public License v3.0
1.98k stars 524 forks source link

onnx export failed due to invalid ouput sizes in nn.Upsample #123

Closed wiekern closed 2 years ago

wiekern commented 2 years ago

I get an error when I attempt to convert torch model as onnx format, The main error message shows below

    result = self._slow_forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 860, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/yolor/models/models.py", line 546, in forward
    return self.forward_once(x)
  File "/yolor/models/models.py", line 607, in forward_once
    x = module(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 887, in _call_impl
    result = self._slow_forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 860, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/upsampling.py", line 141, in forward
    return F.interpolate(input, self.size, self.scale_factor, self.mode, self.align_corners)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 3532, in interpolate
    return torch._C._nn.upsample_nearest2d(input, output_size, scale_factors)
RuntimeError: Input and output sizes should be greater than 0, but got input (H: 10, W: 10) output (H: 0, W: 0)

This error comes from the upsample operator in pytorch.

194         elif mdef['type'] == 'upsample':
195             if ONNX_EXPORT:  # explicitly state size, avoid scale_factor
196                 print(f"upsample: {mdef['stride']}, yolo_index: {yolo_index}")
197                 g = (yolo_index + 1) * 2 / 32  # gain
198                 modules = nn.Upsample(size=tuple(int(x * g) for x in img_size))  # img_size = (320, 192)
199             else:
200                 modules = nn.Upsample(scale_factor=mdef['stride'])

I notice the falg ONNX_EXPORT in models/models.py, but I have no clue how to use it properly. In my case, this flag is False by default before training, and then I have a trained model of YOLOR_P6 with single class instead of 85 coco classes, now I need to export onnx, so set this flag to True and specify the weights file of the trained model, starting converting and this error appears. If you look into the source code, the 0 output size generated by the upsample factor namely variable g which is 0 since yolo_index always -1. I tried to print mdef['stride'] out, getting 2, then hardcode g = 2 avoiding the case g=0 , however another error appears (sizes mismatch due to wrong output size, my image input size is 640x640).

  File "/yolor/models/models.py", line 547, in forward
    return self.forward_once(x)
  File "/yolor/models/models.py", line 598, in forward_once
    x = module(x, out)  # WeightedFeatureFusion(), FeatureConcat()
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 887, in _call_impl
    result = self._slow_forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 860, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/yolor/utils/layers.py", line 69, in forward
    return torch.cat([outputs[i] for i in self.layers], 1) if self.multiple else outputs[self.layers[0]]
RuntimeError: Sizes of tensors must match except in dimension 1. Got 20 and 1280 in dimension 2 (The offending index is 1)

This tells me that I configure the output size incorrectly, so could anyone help me to solve it? It is very helpful if the tutorial steps are listed, many thanks!

convert script I applied is https://github.com/NNDam/yolor/blob/main/convert_to_onnx.py Related defect of upsample/interpolate in pytorch I found a fix request at https://github.com/pytorch/pytorch/pull/18875

wiekern commented 2 years ago

If I set ONNX_EXPORT=True , then start training, the same runtime error of invalid output sizes occurs

RuntimeError: input and output sizes should be greater than 0, but got input (H: 10, W: 10) output (H: 0, W: 0)
wiekern commented 2 years ago

configuring ONNX_EXPORT=False in models/models.py, the conversion is successful by running https://github.com/NNDam/yolor/blob/main/convert_to_onnx.py in which you need to specify your custom arguments of weights, cfg, output, max_size. The environment is from the docker image nvcr.io/nvidia/pytorch:21.03-py3.

  1. Prepare your model
  2. Installing docker under linux
  3. sudo docker pull nvcr.io/nvidia/pytorch:21.03-py3
  4. sudo docker run -it -v/host/dir/to/be/mounted:/docker/dir/you/can/specify --gpus=1 --rm --net=host nvcr.io/nvidia/pytorch:21.03-py3 (Please modify the -v paramenter according to your needs, it mounts a local host dir to docker container environment, your convertion script and model file could be located here)
  5. Enter mounted dir, run python convert_to_onnx.py