eveningdong / DeepLabV3-Tensorflow

Reimplementation of DeepLabV3
285 stars 98 forks source link

Multigrid block misunderstanding. #13

Closed howard-mahe closed 6 years ago

howard-mahe commented 6 years ago

Hi Nanqing,

First, thanks a lot for your implementation, this is a great piece of work!

I feel like you misunderstood the Multigrid block of DeepLabV3 network. You create a bottleneck_hdc unit that do:

conv1: conv 1x1, stride=1 rate=rate*multi_grid[0]
conv2: conv 3x3, stride=stride, rate=rate*multi_grid[1]
conv3: conv 1x1, stride=1, rate=rate*multi_grid[2]

and then you repeat 3 times bottleneck_hdc unit. In ResNet bottleneck, conv1 is a decreasing projection, conv2 is a 3x3 convolution where dilation is supposed to happen, conv3 is an increasing projection. Please note that projections are 1x1 convolutions, for which dilation_rate doesn't have any effect.

What is described in DeepLabV3 for block n | n={4,5,6,7}, is a succession of 3 standard bottleneck_v1 whose dilation rates are multi_grid=(1,2,1) for each of the 3 blocks, respectively. The corrected code should be:

multi_grid=(1,2,1)
D = 512
for r in range(4):
  with tf.variable_scope('block%d'%(r+4), [net]):
    rate=2**(r+1)
    for i in range(3):
      with tf.variable_scope('unit_%d'%(i+1), [net]):
        dilation_rate=rate*multi_grid[i]
        net = bottleneck(net, D*4, D, stride=1, rate=dilation_rate)
John1231983 commented 6 years ago

Great. I am totally agree with you about that. You have to wait the author reply because he may be busy in this time. To achieve the performance as the paper report. The code needs to change something including training batch normalization and fine-tune from pretrained model. Otherwise, you cannot achieve the target performance. I have used pretrained model and got abiht 74%( paper is 75.78%). I still check what is my problem that loss 1.8%

John1231983 commented 6 years ago

@howard-mahe: I found other bugs that are

  1. Image order must be RGB instead of BRG. Because the tf-slim used the RGB order to obtain pre-trained model L225
  channels = tf.split(axis=2, num_or_size_splits=num_channels, value=image)
  for i in range(num_channels):
    channels[i] -= means[i]
  return tf.concat(axis=2, values=channels)
  1. The IMAGE_MEAN must be imageNet mean because it trained from imagenet as
_R_MEAN = 123.68
_G_MEAN = 116.78
_B_MEAN = 103.94

We should you same pre-processing as the tf-slim did. Correct me if I was wrong?

howard-mahe commented 6 years ago
  1. Yes. RGB<->BGR conversion must be removed in voc.py at l.31 and l.38 because: a) tf.gfile.FastGFile.read() (in convert_voc12.py at l.62) loads RGB image b) TF-Slim's models have also been trained with RGB images. It's a common mistake among TF-Slim users since Caffe is based on OpenCV which loads BGR image.
  2. No. The objective of the preprocessing step is to normalize input images. But, I agree the preprocessing have to produce normalized input images similar to the ones used in the pre-training. Employing VOC12's IMAGE_MEAN for VOC12 input images is right: this will produce zero-centered input images, which wouldn't have been the case with ImageNet's IMAGE_MEAN.
John1231983 commented 6 years ago
  1. IMAGE_MEAN is not mean of VOC dataset. It is mean of imagenet that trained vgg model. The users used it because it has a long story. I have read some threads and they mentioned it. You can find it in the deeplab v2 or faster rcnn
howard-mahe commented 6 years ago

Ok, thank you for the details! But why using (an alternative) ImageNet's mean instead of VOC's mean?

John1231983 commented 6 years ago

Because the tfslim used imagenet mean in the preprocessing to achieve pretrained model.

howard-mahe commented 6 years ago

This bring us back to a common question: during fine-tuning, should we use fine-tuning dataset image mean or pre-training data image mean? As I said, I believe (and FCN authors also do) the role of preprocessing is to zero-center input images, so I vote for using fine-tuning dataset image mean.

Are you sure (103.94, 116.78, 123.68) is not the VOC11 mean ? It looks like it is according to FCN repo which use nyud/pascalcontext/siftlow's mean when fine-tuning on these dataset, because they also use (103.94, 116.78, 123.68) as mean for voc fine-tuning (link)

John1231983 commented 6 years ago

FCN used VGG pretrained model that trained with your image mean. So yiu can see that value in the FCN case. However, for pascal voc it may difference. Btw, i have tested witg two difference image mean and the performance is not much difference. So we can use any kind of image mean. Only importance in image order. Let me know if you find any other bugs of this code. I still loss 1.7% comparison with paper report

eveningdong commented 6 years ago

@howard-mahe @John1231983 Thanks for your suggestions, I will take some time to look at it. I was working on something else the past two months.

eveningdong commented 6 years ago

@howard-mahe Hi, Howard. I think you are right about the multi-grid implementation. Thanks for digging into my crappy code. I will update the code ASAP. Your explanation of 1x1 project inspires me a lot.

@John1231983 Hi, John. Yes, tf.gfile.FastGFile.read() read images as RGB, while IMG_MEAN = np.array((104.00698793,116.66876762,122.67891434), dtype=np.float32) is corresponded to BGR. I have seen literature which did things like what I did. I am not sure which mode Slim used to train ImageNet, but after a long run training, the network will correct itself finally, since the mean-subtraction is only a preprocessing method, not a decisive thing. Check preprocessing for VGG inpu, I saw both preprocessing methods reported as doable.

I will rerun the program to update the results, the previous performance seems to be a combination of luck and overfitting.

John1231983 commented 6 years ago

Happy to see you again. You also can look at my question "Training with Batch normalization". We also discuss some bug in your code that explain why your performance is too high than paper

howard-mahe commented 6 years ago

@NanqingD Thanks for your feedback! I also believe that IMG_MEAN doesn't matter that much at the end. @John1231983 spotted the most important bug about validation being performed on train set... Great code anyway and thank you for upcoming updates!

bhack commented 6 years ago

Yes I think that https://github.com/NanqingD/DeepLabV3-Tensorflow/issues/11 is the most important one.

eveningdong commented 6 years ago

@howard-mahe Hi, sorry to bother you again. Do you mind checking my ASPP implementation? I found others implement ASPP without relu function. I am not sure about this because the author didn't release the source code.

bhack commented 6 years ago

If somebody is interested: https://github.com/tensorflow/models/blob/master/research/deeplab/README.md

John1231983 commented 6 years ago

@bhack. We have wait in the long time. Thanks so so much your info.