Closed JananiKugaa closed 1 year ago
The direct replace operation is not differentiable, which means the attn_policy
produced by our prediction modules will not receive any gradients during training. Thus, it is not suitable as the prediction modules for token division are not learning at all. However, during inference, you can use any operation with the identical functionality if you think it is better.
For stable training