Wangt-CN / VC-R-CNN

[CVPR 2020] The official pytorch implementation of ``Visual Commonsense R-CNN''
MIT License
352 stars 61 forks source link

transformer refining #1

Closed fawazsammani closed 4 years ago

fawazsammani commented 4 years ago

Hello. Thanks for your great work! It's really a big contribution to the CV community. You've mentioned that the performance is worse when you feed your concatenated features (BU + VC) to the transformer refining model in AoANet directly. Have you tried refining first (running only BU features through the AoANet refiner), getting the refined features, and then concatenating those refined features with the VC features?

Wangt-CN commented 4 years ago

Hi, very glad about your interest on our work. It's a very good question! Actually we have tried this scheme. The detail is: 1) The Up-Down feature into the AoANet refiner. (vector size: 2048 -> 1024) 2) Concatenate on our VC Feature (vector size: 1024 -> 2048) 3) We find the change of the vector size would influence the following architecture, therefore to minimize the difference with the previoue AoA, we just add an embedding layer on the concatenated feature (vector size: 2048 -> 1024)

However, the performance from our experiment is not good (Cider: 127.5, B4: 38.9). The probable reason maybe after the refine layer, the semantic info in Up-Down feature may have a gap with our VC.

Moreover, we haven't tried other scheme, if you got some ideas or exploration, very welcome to disscuss with us. Let's purse this great work together :)

fawazsammani commented 4 years ago

Thank you!

fawazsammani commented 4 years ago

Hi @Wangt-CN Sorry for opening this issue again, but I have another question. For the captioning tasks, you applied your features on LSTM-based models (Up-Down and AoANet). Even though AoANet uses some parts from the Transformer, but if you look at the decoder part, it is essentially an LSTM which produces a single hidden state and passes it as the query to the encoder multi-head attention (it gets red of the self-attention in the transformer decoder), and still operates on several timesteps. In the last 3 days i've ran your concatenated VC features using the vanilla Transformer model (with and without Encoder). With using the encoder, I got the same score as just using BU features alone, which is around 1.157 cider using beam size 3. Without the transformer encoder, i got score reduced to around 1.143 cider with beam size 3. Note that the features are reduced to 512 before inputting them to the transformer. So my question is do the features you provided work well with the Transformer model? Or only with LSTMs? Any experiments you did on Transformers and can share?

Wangt-CN commented 4 years ago

Hi, Thanks for your experiments and give us the feedback! I would try to give my opinions from 3 folds and holp they can help you:

  1. We did NOT run our feature on vanilla Transformer model in image captioning, therefore in our paper we use self-attentional operation rather than transformer. And we thought the behind reason can be that the self-attention operation at the beginning of model near feature may `disturb' our VC.
  2. Yes, you are right, the AoANet deocder is essentially the LSTM structure with multi-head attention. Therefore, the probable reason maybe that our VC is not suitable for the transformer structure with direct concatenation. Since the transformer decoder is quite similar to the transformer encoder, discarding the encoder seems help nothing.
  3. Moreover, is the performance of BU and BU+VC still keep same under the situation of without the transformer encoder? Actually it's weird that the performance keep same after concatentaion with our VC. Maybe the following tips may help you for performance improving:
    • Before inputting into transformer, you can try to add the embedding size, for example 1024. (512 maybe a little small)
    • I am not sure your current implementation of the transformer without encoder. Maybe you can try some other encoding structure rather than just embedding I think. (However I think this is not a very good method)
    • You can try some other feature fusion operation rather than just concatenation. I think we cannot complain the model (The transformer structure), but should to reflect the usage of our feature. I will also try some other methods for feature operation in my spare time. If I have some results I will told you :)
fawazsammani commented 4 years ago

Thanks you very much for your detailed reply. I will investigate further on this. Best regards