Closed ChaozhongLiu closed 4 days ago
Hi, In our implementation, zero does not share the same token as the pad; it shares the same process with non-zero values. However, since the value is 0, the weight in the MLP is not activated, leaving the bias vector as the output for the subsequent calculation. Therefore, zero embeddings can be regarded as randomly initialized embeddings.
Thanks! That makes sense.
Hi there,
Thanks for the awesome work! As I'm learing the pre-training details, I got one question that could not be solved with the code currently available in this Repo:
In Supplementary Note 8, "Zero value was directly converted into a randomly initialized embedding", but in the corresponding code, only MASK and PAD are specially treated, not the zero values.
So how was this actually implemented? I suppose zero shared the same token as pad? But I don't know what are the
mask_token_id
andpad_token_id
.Would be great if you can provide any detail! Thanks!