AIPHES / emnlp19-moverscore

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
MIT License
192 stars 31 forks source link

Additionnal Datasets? #16

Open PierreColombo opened 3 years ago

PierreColombo commented 3 years ago

Hello, Thanks for your work ! Do you have the code for the two other datasets : BAGEL and SFHOTEL ? Kindest regards :0

andyweizhao commented 3 years ago

Hello Pierre,

Thanks a lot for your interest! I'd like to sort out the code on BAGEL and SFHOTEL in my free time. Meanwhile, you could check the datasets in https://github.com/jeknov/EMNLP_17_submission and run moverscore on them easily :-)

PierreColombo commented 3 years ago

:) many thanks for your reply :0 for correlation on the datasets do you consider an average of the 3 annotors for correlation ? or the median value ? (i obtain very low correlations arround 0.04 for quality for instance)

PierreColombo commented 3 years ago

Hello can you help me with coco ? where did you get the data? It is mentionned 5k images where the val set has 40k images :)

PierreColombo commented 3 years ago

I seems that the official split is no longuer available however you report results on the official split for instance METEOR. How do you sample the 5k images?

andyweizhao commented 3 years ago

@PierreColombo

We follow the setup in the paper to report the results on MSCOCO. First, you need to download the system captions in the link, generated by participating teams. In the MSCOCO leaderboard, you could find human scores at system-level, e.g., M1-M5 scores. Next, you measure metric scores between system and reference captions and average these instance-level scores over all instances to obtain system-level scores. Lastly, you compute system-level correlation between system-level metric and human scores.

Note that the M1-M5 human scores are reported in the test set, however, we use the metric scores in the validation sets b/c human scores in the test set are not released), which causes the mismatch of metric and human scores (See the discussion in https://arxiv.org/abs/1806.06422).

Hope this could help.

PierreColombo commented 3 years ago

Hello, Thanks for your reply :) but the val set contains 40k images and in your paper 5k is mentionned. Did you use 5k or the full 40K?

andyweizhao commented 3 years ago

The results are reported on the full validation set (40k). Apologies for accidentally obscuring the setup in the paper.