Open YoungLNB opened 2 years ago
作者之前在其他 issue 回答过:
I believe it has no impact on the training. The reason I use the multiplication of 2 is that I want to keep the total weights the same as addition.
In the direct addition case, X + Y is actually 1 X + 1 Y, the sum of the weight is 2. However, in a soft selection way, M(X+Y) X + (1 - M(X+Y)) Y, the sum of the weight is 1, so I multiply 2 to keep them the same. Then the only difference between 1 X + 1 Y and 2 M(X+Y) X + 2 (1 - M(X+Y)) Y is the dynamic weight allocation, but the sum of the weights keeps the same.
xo = 2 x wei + 2 residual (1 - wei)