abhaydoke09 / Bilinear-CNN-TensorFlow

This is an implementation of Bilinear CNN for fine grained visual recognition using TensorFlow.
191 stars 72 forks source link

Where are two CNN in your bilinear CNN on tensorflow? #2

Open LogisticFreedom opened 7 years ago

LogisticFreedom commented 7 years ago

Thanks for your code about binlinaer CNN,but I have a question, I have read this paper, and I think there are two CNN in this model, but I can't find them in your code, I only find one VGG-16 net, did I miss something? Could you explain it in your code ?Thank you very much!

abhaydoke09 commented 7 years ago

Nice question @LogisticFreedom. I was also confused with this idea initially. There are two types on Bilinear CNNs.

  1. Symmetric - Where two networks used are identical. e.g. VGG16-VGG16, Mnet - Mnet, Alexnet-Alexnet.
  2. Asymmetric - Where two networks used are not identical. e.g VGG16-Mnet, Mnet-Alexnet

This particular implementation is of type Symmetric with VGG16-VGG16. Since weight initializations for both the VGG16 is same, weight updates will be same for both the networks and both the networks will have same weights after every iteration. So instead of declaring two networks we can just have a single network and we can save the memory by half.

self.phi_I = tf.einsum('ijkm,ijkn->imn',self.conv5_3,self.conv5_3) This line is doing the outer product on the output of the conv5_3 layer and is similar to having two identical networks.

When you are implementing Asymmetric Bilinear CNN, for example VGG16-Mnet, you will need to define two separate network definitions because the weight initializations will be different for both the networks. You will have one network definition for VGG16 and one for the Mnet. Then just have the outer product using "tf.einsum" on the outputs of their final convolutional layers.

Hope that answers your question. Please let me know if you need more information.

LogisticFreedom commented 7 years ago

@abhaydoke09 Thank you very much for your answer! And I have another problem, how can I use ResNet to build a bilinear CNN model, I use ResNet101,its output is 204877.I have try it in Keras, but it doesn't work.

abhaydoke09 commented 7 years ago

what's the output size of the last convolutional layer??

ahmadmobeen commented 7 years ago

If using two identical CNNs have the same weight initialization, same weight update and both networks will have the same weights after every iteration. Then, what is the benefit of using BCNN instead of a normal CNN architecture? What am I missing?

abhaydoke09 commented 7 years ago

When we are taking an outer product of the last layers of these identical networks, we are getting the confusion matrix of features at every location. The combined form now looks at the pairwise features at different locations. Take a look at slide 6 in http://people.cs.umass.edu/~smaji/presentations/BilinearModelsICCV2015oral.pdf

ahmadmobeen commented 7 years ago

According to slide 6, it makes sense that one feature extractor (in this case a cnn) will get different features (e.g. part) and the second feature extractor (another cnn) will get different features (e.g. color). But I think this can only be the case with assymetrical CNNs. In the case of symmetrical CNNs, the same features are being extracted at each location. What I understand is during outer product one matrix is transposed due to which we get confused features from different locations. Is that right ?

YanWang2014 commented 6 years ago

Interesting discussion. In the paper Improved Bilinear Pooling with CNNs, the author says symmetric B-CNNs are identical to the Second-Order Pooling (O2P).

data-scientist-ml1 commented 6 years ago

Hello @abhaydoke09 , wonderful work. I have same question as @ahmadmobeen. Can you please tell how same network e.g., Vgg16 is extracting different features at same location? Thanks :-)

JUSTDODoDo commented 5 years ago

hello, I still have a problem. After running the second part of the whole model, I will finish training. It seems that the final model is not saved in the code. Why is this done in the absence of the training model? Can you give me some details?