Closed fawazsammani closed 4 years ago
Hi, very glad about your interest on our work. It's a very good question! Actually we have tried this scheme. The detail is: 1) The Up-Down feature into the AoANet refiner. (vector size: 2048 -> 1024) 2) Concatenate on our VC Feature (vector size: 1024 -> 2048) 3) We find the change of the vector size would influence the following architecture, therefore to minimize the difference with the previoue AoA, we just add an embedding layer on the concatenated feature (vector size: 2048 -> 1024)
However, the performance from our experiment is not good (Cider: 127.5, B4: 38.9). The probable reason maybe after the refine layer, the semantic info in Up-Down feature may have a gap with our VC.
Moreover, we haven't tried other scheme, if you got some ideas or exploration, very welcome to disscuss with us. Let's purse this great work together :)
Thank you!
Hi @Wangt-CN Sorry for opening this issue again, but I have another question. For the captioning tasks, you applied your features on LSTM-based models (Up-Down and AoANet). Even though AoANet uses some parts from the Transformer, but if you look at the decoder part, it is essentially an LSTM which produces a single hidden state and passes it as the query to the encoder multi-head attention (it gets red of the self-attention in the transformer decoder), and still operates on several timesteps. In the last 3 days i've ran your concatenated VC features using the vanilla Transformer model (with and without Encoder). With using the encoder, I got the same score as just using BU features alone, which is around 1.157 cider using beam size 3. Without the transformer encoder, i got score reduced to around 1.143 cider with beam size 3. Note that the features are reduced to 512 before inputting them to the transformer. So my question is do the features you provided work well with the Transformer model? Or only with LSTMs? Any experiments you did on Transformers and can share?
Hi, Thanks for your experiments and give us the feedback! I would try to give my opinions from 3 folds and holp they can help you:
self-attentional operation
rather than transformer
. And we thought the behind reason can be that the self-attention operation at the beginning of model near feature may `disturb' our VC.without the transformer encoder
? Actually it's weird that the performance keep same after concatentaion with our VC. Maybe the following tips may help you for performance improving:
transformer without encoder
. Maybe you can try some other encoding structure rather than just embedding I think. (However I think this is not a very good method)Thanks you very much for your detailed reply. I will investigate further on this. Best regards
Hello. Thanks for your great work! It's really a big contribution to the CV community. You've mentioned that the performance is worse when you feed your concatenated features (BU + VC) to the transformer refining model in AoANet directly. Have you tried refining first (running only BU features through the AoANet refiner), getting the refined features, and then concatenating those refined features with the VC features?