Closed DaehanKim closed 7 months ago
Hi there, thanks for your interest.
For pre-training from scratch, my suggestion is to initialize the cost matrix from some well-trained token embeddings. If that embeddings share the same vocabulary as your model, that would be great, you just calculate the distance between each pair of tokens, and you can get the cost matrix. If the vocabulary is not the same, you can at least use it to initialize their intersection part.
In theory, any method that can produce meaningful distance between tokens can be utilized to initialize C.
I didn't personally try pre-training from-scratch. But according to some other researchers who tried, the way they use EMO is to first pretrain using normal MLE loss, and then switch to EMO.
Best regards.
Thank you for a detailed answer!
Hi, Thank you for sharing this work.
I wonder the applicability of this method for pretraining from-scratch. To do that, one needs to build a good initial cost matrix $\mathbb{C} = [C(v_i, vj)]{i,j}$. I can come up with a trivial uniform initialization for this. Is there a better way than this?
If you ever tried this method for pretraining from scratch, it would be really helpful to know the results. Thanks!