a1da4 commented 5 years ago

0. Paper

@inproceedings{zhao-etal-2019-moverscore, title = "{M}over{S}core: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance", author = "Zhao, Wei and Peyrard, Maxime and Liu, Fei and Gao, Yang and Meyer, Christian M. and Eger, Steffen", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D19-1053", doi = "10.18653/v1/D19-1053", pages = "563--578", } My literature review slide (in Japanese) [speakerdeck]

1. What is it?

The authors proposed the text evaluation using Contextual Embedding and Word Mover's Distance.

2. What is amazing compared to previous studies?

In related work,

one is using Contextual Embedding; BERT or ELMo
the other is using Word Mover's Distance

This work, the authors used both.

3. Where is the key to technologies and techniques?

This method is called MoverScore, there are four dimensions.

granularity of embedding: uni-gram, bi-gram, and all words in a sentence.
embedding mechanism: word2vec, BERT, and ELMo
fine-tuning for BERT: NLI datasets(MultiNLI, QANLI), and Paraphrase(QQP)
Aggregation technique if you need: Power Means or routing mechanism BERT: last five layers ELMo: all three layers

Aggregation

Consolidating layer-wise information is problem:

different layers capture information at disparate scales
task-specific layer selection methods may be limited

A scalar mix of output layers is effective, but they used aggregation functions Consider a sequence passed BERT or ELMo encoders with L layers, there are L different vectors. Therefore, an aggregation function maps these L vectors to one vector.

When we used the Power Means, the φ in the above equation can write

4. How did validate it?

They tried four tasks, Machine Translation, Summarization, Data-to-text Generation, and Image Captioning. The results are that Image Captioning cannot outperform the strong baseline LEIC, but the other tasks can outperform.

5. Is there a discussion?

In MT, they make the score distribution graphs between human judgments and (SentBLEU | MoverScore).

This result shows that MoverScore can clearly distinguish texts of two polar qualities(good or bad).

6. Which paper should read next?

BERTScore: a strong baseline
P-Means: an effective method to aggregate

a1da4 commented 5 years ago

29 BERTScore: Evaluating Text Generation with BERT

a1da4 commented 5 years ago

30 Concatenated Power Means

a1da4 / paper-survey

Reading: MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance #25

0. Paper

1. What is it?

2. What is amazing compared to previous studies?

3. Where is the key to technologies and techniques?

Aggregation

4. How did validate it?

5. Is there a discussion?

6. Which paper should read next?

29 BERTScore: Evaluating Text Generation with BERT

30 Concatenated Power Means