HRNet / HRNet-Semantic-Segmentation

The OCR approach is rephrased as Segmentation Transformer: https://arxiv.org/abs/1909.11065. This is an official implementation of semantic segmentation for HRNet. https://arxiv.org/abs/1908.07919
Other
3.13k stars 686 forks source link

How is multi-scale fusion performed? #229

Open kadattack opened 3 years ago

kadattack commented 3 years ago

How exactly is is multi-scale fusion performed when combining all the outputs from different branches in to one? I am asking about the process that happens AFTER the strided convolution and the upscaling is performed to get all of them the same size. Does it do a simple element-wise sum of all the outputs in to one? Or does it concatenate the outputs in to different channels? Capture

StuvX commented 3 years ago

You can see this in the forward pass of the HighResolutionNet module. After interpolation upsizing the resulting arrays are concatenated, and then passed through the last_layer submodule that consists of:

self.last_layer = nn.Sequential( nn.Conv2d( in_channels=last_inp_channels, out_channels=last_inp_channels, kernel_size=1, stride=1, padding=0), BatchNorm2d(last_inp_channels, momentum=BN_MOMENTUM), nn.ReLU(inplace=relu_inplace), nn.Conv2d( in_channels=last_inp_channels, out_channels=config["arch"]["num_classes"], kernel_size=extra["FINAL_CONV_KERNEL"], stride=1, padding=1 if extra["FINAL_CONV_KERNEL"] == 3 else 0) )

There's a final interpolation to enforce that the output size = input size.

kadattack commented 3 years ago

You can see this in the forward pass of the HighResolutionNet module. After interpolation upsizing the resulting arrays are concatenated, and then passed through the last_layer submodule that consists of:

self.last_layer = nn.Sequential( nn.Conv2d( in_channels=last_inp_channels, out_channels=last_inp_channels, kernel_size=1, stride=1, padding=0), BatchNorm2d(last_inp_channels, momentum=BN_MOMENTUM), nn.ReLU(inplace=relu_inplace), nn.Conv2d( in_channels=last_inp_channels, out_channels=config["arch"]["num_classes"], kernel_size=extra["FINAL_CONV_KERNEL"], stride=1, padding=1 if extra["FINAL_CONV_KERNEL"] == 3 else 0) )

There's a final interpolation to enforce that the output size = input size.

I'm very new to AI and pytorch but isn't this the code for final output of the whole Hrnet? I don't know if we are thinking about the same thing. Just to reconfirm, I'm talking about the merge process that happens throughout the whole net. image

From my understanding this is made in the function https://github.com/HRNet/HRNet-Semantic-Segmentation/blob/f9fb1ba66ff8aea29d833b885f08df64e62c2b23/lib/models/hrnet.py#L207 , however i'm still not good enough to understand what happens at the end of the forward() function https://github.com/HRNet/HRNet-Semantic-Segmentation/blob/f9fb1ba66ff8aea29d833b885f08df64e62c2b23/lib/models/hrnet.py#L277 It looks like it's adding the layers up with addition? Am I looking at the wrong part of the code?