HRNet / HRNet-Semantic-Segmentation

The OCR approach is rephrased as Segmentation Transformer: https://arxiv.org/abs/1909.11065. This is an official implementation of semantic segmentation for HRNet. https://arxiv.org/abs/1908.07919
Other
3.14k stars 688 forks source link

Only last module in stage should have multi-scale output #115

Open visatish opened 4 years ago

visatish commented 4 years ago

Hi,

I was looking through the codebase and noticed this comment that multi-scale outputs (to be honest I am not 100% sure what these are, and it would be helpful if you could clarify this - i.e. do they have any meaning in the context of the paper, or are they just a helpful argument for bookkeeping implementation-wise) are only used for the final module in a stage. However, this does not seem to be true for the actual implementation because it seems that the multi_scale_output arg to _make_stage is always True, meaning that this if-statement will always evaluate to False, and reset_multi_scale_output will always be True, no matter whether or not it is the last module.

Could you clarify why this is the case and what exactly reset_multi_scale_output is doing? Maybe I am misunderstanding something.

Thanks, Vishal

visatish commented 4 years ago

Also another implementation question: In the paper you state that the fusion from lower resolution to higher resolution is accomplished with bilinear upsampling followed by a 1x1 convolution. However, in the actual implementation it seems like this is reversed as seen here. Can you clarify this?

sunke123 commented 4 years ago

multi_scale_output = false means only returning the high-resolution output (adopted in pose estimation). multi_scale_output = true means returning all the four output (adopted in segmentation).

We do 1x1 convolution before bilinear upsampling, which can reduce the computation complexity.

visatish commented 4 years ago

Interesting okay, I was a bit confused because even when multi_scale_output = False, you do combine representations from the other branches in the fuse layer for the single output that is returned. It seems that the only real difference that there is one less layer of "mixing" achieved by discarding the other outputs, which would be combined further down the road. Do you find that this extra layer makes a significant difference in performance for each task?

mucunwuxian commented 4 years ago

Thank you very much for your great job!

Are the following perceptions correct?

multi_scale_output = False  :  HRNetV1
multi_scale_output = True   :  HRNetV2

reference: image

sunke123 commented 4 years ago

@mucunwuxian Yes.

sunke123 commented 4 years ago

@visatish For pose estimation, we find that combining 4 representations is slightly better than using only the high resolution representation (0.2 points gain on COCO val).

For semantic segmentation and object detection, combining 4 representations is helpful for handling diverse object scales and more categories.

mucunwuxian commented 4 years ago

@sunke123 Thank you for your reply!

I understood, It's very helpful!💡