This paper represents the first use of the Wasserstein distance (i.e. earth mover’s distance) as a loss for supervised learning.
It considers the problem of learning to predict a non-negative measure over a finite set.
Language Models are essentially learning to predict a non-negative measure over a finite set.
In summary, Learning with a Wasserstein Loss has used a similar method to solve a similar problem. The core idea and core technique are the same, the problem is the same in principle.
Nonetheless, your great work proposes a tractable and effective upper bound for EMD and verifies EMD's effectiveness in language model fine-tuning, which is nontrivial and impressive.
Could you please, by any chance, cite Learning with a Wasserstein Loss in the camera-ready version paper? I believe that will help readers to find related works.
Dear authors: Thank you for your great work advancing the frontier of language model training.
Learning with a Wasserstein Loss (arxiv) Learning with a Wasserstein Loss (NeurIPS 2015)
This paper represents the first use of the Wasserstein distance (i.e. earth mover’s distance) as a loss for supervised learning. It considers the problem of learning to predict a non-negative measure over a finite set. Language Models are essentially learning to predict a non-negative measure over a finite set.
In summary, Learning with a Wasserstein Loss has used a similar method to solve a similar problem. The core idea and core technique are the same, the problem is the same in principle.
Nonetheless, your great work proposes a tractable and effective upper bound for EMD and verifies EMD's effectiveness in language model fine-tuning, which is nontrivial and impressive.
Could you please, by any chance, cite Learning with a Wasserstein Loss in the camera-ready version paper? I believe that will help readers to find related works.
本来想在OpenReview发个Public Comment,但现在不能发了😂