a1da4 / paper-survey

Summary of machine learning papers
32 stars 0 forks source link

Reading: MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance #25

Open a1da4 opened 4 years ago

a1da4 commented 4 years ago

0. Paper

@inproceedings{zhao-etal-2019-moverscore, title = "{M}over{S}core: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance", author = "Zhao, Wei and Peyrard, Maxime and Liu, Fei and Gao, Yang and Meyer, Christian M. and Eger, Steffen", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D19-1053", doi = "10.18653/v1/D19-1053", pages = "563--578", } My literature review slide (in Japanese) [speakerdeck]

1. What is it?

The authors proposed the text evaluation using Contextual Embedding and Word Mover's Distance.

2. What is amazing compared to previous studies?

In related work,

This work, the authors used both.

3. Where is the key to technologies and techniques?

This method is called MoverScore, there are four dimensions.

Aggregation

Consolidating layer-wise information is problem:

A scalar mix of output layers is effective, but they used aggregation functions Consider a sequence passed BERT or ELMo encoders with L layers, there are L different vectors. Therefore, an aggregation function maps these L vectors to one vector.

スクリーンショット 2019-10-06 23 06 47

When we used the Power Means, the φ in the above equation can write

スクリーンショット 2019-10-06 23 06 59

4. How did validate it?

They tried four tasks, Machine Translation, Summarization, Data-to-text Generation, and Image Captioning. The results are that Image Captioning cannot outperform the strong baseline LEIC, but the other tasks can outperform.

5. Is there a discussion?

In MT, they make the score distribution graphs between human judgments and (SentBLEU | MoverScore).

スクリーンショット 2019-10-06 23 15 51

This result shows that MoverScore can clearly distinguish texts of two polar qualities(good or bad).

6. Which paper should read next?

a1da4 commented 4 years ago

29 BERTScore: Evaluating Text Generation with BERT

a1da4 commented 4 years ago

30 Concatenated Power Means