Add metrics for translation and summarization (BLEU, ROUGE, BLEURT, BERTSCORE, COMETKIWI, Perplexity)

Add various metrics such as ROUGE, BERTSCORE, BLEURT, etc. Reference-based metrics
- BLEU, ROUGE - N-gram based matching
- BERTSCORE - Embedding similarity with BERT
- BLEURT - Training human prediction for evaluation Reference-free metrics
- COMETKIWI - Sentence-level / Word-level comparison between predicted and source with LM
- Perplexity - how well a model predicts a our domain text sequence
make efficient when we calculate metrics (Calculate score in aggregation step - metric is called for every sample!)

I'll update README.md soon.

Y-IAB / lm-evaluation-harness