Object detection with multi-level representations generated from deep high-resolution representation learning (HRNetV2h). This is an official implementation for our TPAMI paper "Deep High-Resolution Representation Learning for Visual Recognition". https://arxiv.org/abs/1908.07919
Apache License 2.0
644
stars
97
forks
source link
Upsample: differences in paper from implementation #45
In the _make_fuse_layers, the upsampling is done after the 1x1 convolution. However, in the paper the upsampling is done before.
If x > r, f_{xr}(R) upsamples the input representation R through the bilinear upsampling followed by a 1 × 1 convolution for aligning the number of channels.
Moreover, the paper is using bilinear upsampling while the implementation uses with mode='nearest'.
In the
_make_fuse_layers
, the upsampling is done after the 1x1 convolution. However, in the paper the upsampling is done before.Moreover, the paper is using bilinear upsampling while the implementation uses with
mode='nearest'
.Is there any reasons for these two differences ?