Y-IAB / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
0 stars 0 forks source link

Add metrics for translation and summarization (BLEU, ROUGE, BLEURT, BERTSCORE, COMETKIWI, Perplexity) #5

Closed myeongho-jeong-yanolja closed 6 months ago

myeongho-jeong-yanolja commented 6 months ago
  1. Add various metrics such as ROUGE, BERTSCORE, BLEURT, etc. Reference-based metrics
    • BLEU, ROUGE - N-gram based matching
    • BERTSCORE - Embedding similarity with BERT
    • BLEURT - Training human prediction for evaluation Reference-free metrics
    • COMETKIWI - Sentence-level / Word-level comparison between predicted and source with LM
    • Perplexity - how well a model predicts a our domain text sequence
  2. make efficient when we calculate metrics (Calculate score in aggregation step - metric is called for every sample!)

I'll update README.md soon.

seungduk-yanolja commented 6 months ago

Please update the PR title to describe briefly what metrics they are. E.g., Add metrics for translation and summarization