eldar / pose-tensorflow

Human Pose estimation with TensorFlow framework
GNU Lesser General Public License v3.0
1.14k stars 384 forks source link

Conv_3 bank? #100

Open jbohnslav opened 5 years ago

jbohnslav commented 5 years ago

Hi there,

Thanks for the great work. I asked this question over at DeepLabCut, but I figured I should ask it again here.

I'm not used to TensorFlow, so it could be due to my inability to read tf code, but I'm confused about the prediction layers in the model. In both the DeeperCut paper and DeepLabCut paper, the authors describe using a ResNet base followed by 2x upsampling with deconvolution layers. Then, the authors "connect the final output to the output of the conv3 bank."

In the code, features are extracted with the net_funcs imported from tf.slim: resnet_v1.resnet_v1_50 and resnet_v1.resnet_v1_101. Due to the use of atrous convolutions, and the lack of global average pooling (etc)., I think the features should be of shape (N, H/16, W/16, 2048).

They are then, I believe, passed to the following prediction layer:

def prediction_layer(cfg, input, name, num_outputs):
    with slim.arg_scope([slim.conv2d, slim.conv2d_transpose], padding='SAME',
                        activation_fn=None, normalizer_fn=None,
                        weights_regularizer=slim.l2_regularizer(cfg.weight_decay)):
        with tf.variable_scope(name):
            pred = slim.conv2d_transpose(input, num_outputs,
                                         kernel_size=[3, 3], stride=2,
                                         scope='block4')
            return pred

This just means that the output of the ResNet is passed into a deconvolution layer, without any connection to conv3. Did I miss it somewhere?

Both papers use the phrasing "connected to", so I'm not sure if it's supposed to be concatenate + conv2d, addition, and whether or not the connection happens to the upsampled features or original features. I expected to see (in pseudocode) the prediction layer be something like this:

upsampled_features = conv2d_transpose(features)
outputs = conv2d(concatenate(upsampled_features,conv3))