@inproceedings{zhao-etal-2019-moverscore,
title = "{M}over{S}core: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance",
author = "Zhao, Wei and
Peyrard, Maxime and
Liu, Fei and
Gao, Yang and
Meyer, Christian M. and
Eger, Steffen",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1053",
doi = "10.18653/v1/D19-1053",
pages = "563--578",
}
My literature review slide (in Japanese) [speakerdeck]
1. What is it?
The authors proposed the text evaluation using Contextual Embedding and Word Mover's Distance.
2. What is amazing compared to previous studies?
In related work,
one is using Contextual Embedding; BERT or ELMo
the other is using Word Mover's Distance
This work, the authors used both.
3. Where is the key to technologies and techniques?
This method is called MoverScore, there are four dimensions.
granularity of embedding: uni-gram, bi-gram, and all words in a sentence.
embedding mechanism: word2vec, BERT, and ELMo
fine-tuning for BERT: NLI datasets(MultiNLI, QANLI), and Paraphrase(QQP)
Aggregation technique if you need: Power Means or routing mechanism
BERT: last five layers
ELMo: all three layers
Aggregation
Consolidating layer-wise information is problem:
different layers capture information at disparate scales
task-specific layer selection methods may be limited
A scalar mix of output layers is effective, but they used aggregation functions
Consider a sequence passed BERT or ELMo encoders with L layers, there are L different vectors.
Therefore, an aggregation function maps these L vectors to one vector.
When we used the Power Means, the φ in the above equation can write
4. How did validate it?
They tried four tasks, Machine Translation, Summarization, Data-to-text Generation, and Image Captioning.
The results are that Image Captioning cannot outperform the strong baseline LEIC, but the other tasks can outperform.
5. Is there a discussion?
In MT, they make the score distribution graphs between human judgments and (SentBLEU | MoverScore).
This result shows that MoverScore can clearly distinguish texts of two polar qualities(good or bad).
0. Paper
@inproceedings{zhao-etal-2019-moverscore, title = "{M}over{S}core: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance", author = "Zhao, Wei and Peyrard, Maxime and Liu, Fei and Gao, Yang and Meyer, Christian M. and Eger, Steffen", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D19-1053", doi = "10.18653/v1/D19-1053", pages = "563--578", } My literature review slide (in Japanese) [speakerdeck]
1. What is it?
The authors proposed the text evaluation using Contextual Embedding and Word Mover's Distance.
2. What is amazing compared to previous studies?
In related work,
This work, the authors used both.
3. Where is the key to technologies and techniques?
This method is called MoverScore, there are four dimensions.
Aggregation
Consolidating layer-wise information is problem:
A scalar mix of output layers is effective, but they used aggregation functions Consider a sequence passed BERT or ELMo encoders with L layers, there are L different vectors. Therefore, an aggregation function maps these L vectors to one vector.
When we used the Power Means, the φ in the above equation can write
4. How did validate it?
They tried four tasks, Machine Translation, Summarization, Data-to-text Generation, and Image Captioning. The results are that Image Captioning cannot outperform the strong baseline LEIC, but the other tasks can outperform.
5. Is there a discussion?
In MT, they make the score distribution graphs between human judgments and (SentBLEU | MoverScore).
This result shows that MoverScore can clearly distinguish texts of two polar qualities(good or bad).
6. Which paper should read next?