Closed liuwei1206 closed 3 years ago
I believe it is important for the performance as it controls how much the lexical vector should be shifted in presence of other modalities
On Fri, Dec 25, 2020, 2:23 AM liuwei1206 notifications@github.com wrote:
Hi, I found that you use a scaler when fusing different features, but didn't see any analysis about it. So I want to know if the scaler is very crucial for the performance? [image: image] https://user-images.githubusercontent.com/34615810/103125058-da018880-46c4-11eb-9743-381c069ceda4.png
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/WasifurRahman/BERT_multimodal_transformer/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7ZJC6X7HNTXOFVFXEP7YLSWQ4XZANCNFSM4VI5BOTA .
That is a very interesting question. We did a hyperparameter search on this scaler. However, we did not run any specific experiment to observe its effect. Our experience is that in the beginning choosing the right value of alpha will help you to converge faster. Theoretically, the mag network should be able to adjust the weight to produce appropriate H_i after certain iterations without the help of the scalar value. However, we put this constraint to guide the network better. Because you do not want to shift the value of pretraiend embedding too much. It will lose its original meaning.
That is a very interesting question. We did a hyperparameter search on this scaler. However, we did not run any specific experiment to observe its effect. Our experience is that in the beginning choosing the right value of alpha will help you to converge faster. Theoretically, the mag network should be able to adjust the weight to produce appropriate H_i after certain iterations without the help of the scalar value. However, we put this constraint to guide the network better. Because you do not want to shift the value of pretraiend embedding too much. It will lose its original meaning.
So which value you choose for the best performance? Intuitively, if the value is big, it may push the pretrained network shift too much; if it is small, it may make no difference compared with the origin pretrained network.
I believe it deserves an extra analysis!
@liuwei1206 Will be closing the branch due to inactivity, if you would like to develop further analysis, feel free to re-open the issue any time!
Hi, I found that you use a scaler when fusing different features, but didn't see any analysis about it. So I want to know if the scaler is very crucial for the performance?