WasifurRahman / BERT_multimodal_transformer

192 stars 29 forks source link

Is scaler very important for the performance? #14

Closed liuwei1206 closed 3 years ago

liuwei1206 commented 3 years ago

Hi, I found that you use a scaler when fusing different features, but didn't see any analysis about it. So I want to know if the scaler is very crucial for the performance? image

WasifurRahman commented 3 years ago

I believe it is important for the performance as it controls how much the lexical vector should be shifted in presence of other modalities

On Fri, Dec 25, 2020, 2:23 AM liuwei1206 notifications@github.com wrote:

Hi, I found that you use a scaler when fusing different features, but didn't see any analysis about it. So I want to know if the scaler is very crucial for the performance? [image: image] https://user-images.githubusercontent.com/34615810/103125058-da018880-46c4-11eb-9743-381c069ceda4.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/WasifurRahman/BERT_multimodal_transformer/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7ZJC6X7HNTXOFVFXEP7YLSWQ4XZANCNFSM4VI5BOTA .

matalvepu commented 3 years ago

That is a very interesting question. We did a hyperparameter search on this scaler. However, we did not run any specific experiment to observe its effect. Our experience is that in the beginning choosing the right value of alpha will help you to converge faster. Theoretically, the mag network should be able to adjust the weight to produce appropriate H_i after certain iterations without the help of the scalar value. However, we put this constraint to guide the network better. Because you do not want to shift the value of pretraiend embedding too much. It will lose its original meaning.

liuwei1206 commented 3 years ago

That is a very interesting question. We did a hyperparameter search on this scaler. However, we did not run any specific experiment to observe its effect. Our experience is that in the beginning choosing the right value of alpha will help you to converge faster. Theoretically, the mag network should be able to adjust the weight to produce appropriate H_i after certain iterations without the help of the scalar value. However, we put this constraint to guide the network better. Because you do not want to shift the value of pretraiend embedding too much. It will lose its original meaning.

So which value you choose for the best performance? Intuitively, if the value is big, it may push the pretrained network shift too much; if it is small, it may make no difference compared with the origin pretrained network.

liuwei1206 commented 3 years ago

I believe it deserves an extra analysis!

RE-N-Y commented 3 years ago

@liuwei1206 Will be closing the branch due to inactivity, if you would like to develop further analysis, feel free to re-open the issue any time!