DRSY / EMO

[ICLR 2024]EMO: Earth Mover Distance Optimization for Auto-Regressive Language Modeling(https://arxiv.org/abs/2310.04691)
114 stars 14 forks source link

Is there a good way to initialize cost matrix when pretraining from-scratch? #11

Closed DaehanKim closed 7 months ago

DaehanKim commented 8 months ago

Hi, Thank you for sharing this work.

I wonder the applicability of this method for pretraining from-scratch. To do that, one needs to build a good initial cost matrix $\mathbb{C} = [C(v_i, vj)]{i,j}$. I can come up with a trivial uniform initialization for this. Is there a better way than this?

If you ever tried this method for pretraining from scratch, it would be really helpful to know the results. Thanks!

DRSY commented 7 months ago

Hi there, thanks for your interest.

For pre-training from scratch, my suggestion is to initialize the cost matrix from some well-trained token embeddings. If that embeddings share the same vocabulary as your model, that would be great, you just calculate the distance between each pair of tokens, and you can get the cost matrix. If the vocabulary is not the same, you can at least use it to initialize their intersection part.

In theory, any method that can produce meaningful distance between tokens can be utilized to initialize C.

I didn't personally try pre-training from-scratch. But according to some other researchers who tried, the way they use EMO is to first pretrain using normal MLE loss, and then switch to EMO.

Best regards.

DaehanKim commented 7 months ago

Thank you for a detailed answer!