Detailed documentation can be found here. To install the API, simply run
pip install eaas
To use the API, You should go through the following two steps.
from eaas import Config
config = Config()
print(config.metrics())
print(config.bleu.to_dict())
config.bleu.set_property("smooth_method", "floor") print(config.bleu.to_dict())
- **Step 2**: Initialize the client and send your inputs. Please send your inputs as a whole (a list of many dicts) instead of sending one sample at a time (which will be much slower).
```python
from eaas import Client
client = Client()
client.load_config(config) # The config you have created above
# To use this API for scoring, you need to format your input as list of dictionary.
# Each dictionary consists of `source` (string, optional), `references` (list of string, optional) and `hypothesis` (string, required). `source` and `references` are optional based on the metrics you want to use.
# Please do not conduct any preprocessing on `source`, `references` or `hypothesis`.
# We expect normal-cased detokenized texts. All the preprocessing steps are taken by the metrics.
# Below is a simple example.
inputs = [{"source": "This is the source.",
"references": ["This is the reference one.", "This is the reference two."],
"hypothesis": "This is the generated hypothesis."}]
metrics = ["bleu", "chrf"] # Can be None for simplicity if you consider using all metrics
score_dic = client.score(inputs, task="sum", metrics=metrics, lang="en", cal_attributes=True)
# inputs is a list of Dict, task is the name of task (for calculating attributes), metrics is metric list, lang is the two-letter code language.
# You can also set cal_attributes=False to save some time since some attribute calculations can be slow.
The output is like
# sample_level is a list of dict, corpus_level is a dict
{
'sample_level': [
{
'bleu': 32.46679154750991,
'attr_compression': 0.8333333333333334,
'attr_copy_len': 2.0,
'attr_coverage': 0.6666666666666666,
'attr_density': 1.6666666666666667,
'attr_hypothesis_len': 6,
'attr_novelty': 0.6,
'attr_repetition': 0.0,
'attr_source_len': 5,
'chrf': 38.56890099861521
}
],
'corpus_level': {
'corpus_bleu': 32.46679154750991,
'corpus_attr_compression': 0.8333333333333334,
'corpus_attr_copy_len': 2.0,
'corpus_attr_coverage': 0.6666666666666666,
'corpus_attr_density': 1.6666666666666667,
'corpus_attr_hypothesis_len': 6.0,
'corpus_attr_novelty': 0.6,
'corpus_attr_repetition': 0.0,
'corpus_attr_source_len': 5.0,
'corpus_chrf': 38.56890099861521
}
}
Currently, EaaS supports the following metrics:
bart_score_cnn_hypo_ref
: BARTScore is a sequence to sequence framework based on pre-trained language model BART. bart_score_cnn_hypo_ref
uses the CNNDM finetuned BART. It calculates the average generation score of Score(hypothesis|reference)
and Score(reference|hypothesis)
.bart_score_summ
: BARTScore using the CNNDM finetuned BART. It calculates Score(hypothesis|source)
.bart_score_mt
: BARTScore using the Parabank2 finetuned BART. It calculates the average generation score of Score(hypothesis|reference)
and Score(reference|hypothesis)
.bert_score_p
: BERTScore is a metric designed for evaluating translated text using BERT-based matching framework. bert_score_p
calculates the BERTScore precision.bert_score_r
: BERTScore recall.bert_score_f
: BERTScore f score.bleu
: BLEU measures modified ngram matches between each candidate translation and the reference translations. chrf
: CHRF measures the character-level ngram matches between hypothesis and reference.comet
: COMET is a neural framework for training multilingual machine translation evaluation models. comet
uses the wmt20-comet-da
checkpoint which utilizes source, hypothesis and reference.comet_qe
: COMET for quality estimation. comet_qe
uses the wmt20-comet-qe-da
checkpoint which utilizes only source and hypothesis.mover_score
: MoverScore is a metric similar to BERTScore. Different from BERTScore, it uses the Earth Mover’s Distance instead of the Euclidean Distance.prism
: PRISM is a sequence to sequence framework trained from scratch. prism
calculates the average generation score of Score(hypothesis|reference)
and Score(reference|hypothesis)
.prism_qe
: PRISM for quality estimation. It calculates Score(hypothesis| source)
.rouge1
: ROUGE-1 refers to the overlap of unigram (each word) between the system and reference summaries.rouge2
: ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.rougeL
: ROUGE-L refers to the longest common subsequence between the system and reference summaries.The task
option in the client.score()
function decides what attributes we calculate. Currently, we only support attributes for summarization task (task=sum
). The following attributes (reference: this paper) will be calculated if cal_attributes
is set to True
in client.score()
. They are all reference-free.
source_len
: measures the length of the source text.hypothesis_len
: measures the length of the hypothesis text.density & coverage
: measures to what extent a summary covers the content in the source text.compression
: measures the compression ratio from the source text to the generated summary.repetition
: measures the rate of repeated segments in summaries. The segments are instantiated as trigrams.novelty
: measures the proportion of segments in the summaries that haven’t appeared in source documents. The segments are instantiated as bigrams.copy_len
: measures the average length of segments in summary copied from source document.We support quick calculation for BLEU and ROUGE(1,2,L), see the following for usage.
from eaas import Config, Client
config = Config()
client = Client()
client.load_config(config)
# Note that the input format is different from the `score` function.
references = [["This is the reference one for sample one.", "This is the reference two for sample one."],
["This is the reference one for sample two.", "This is the reference two for sample two."]]
hypothesis = ["This is the generated hypothesis for sample one.",
"This is the generated hypothesis for sample two."]
# Calculate BLEU
client.bleu(references, hypothesis, task="sum", lang="en", cal_attributes=False)
# Calculate ROUGEs
client.rouge1(references, hypothesis, task="sum", lang="en", cal_attributes=False)
client.rouge2(references, hypothesis, task="sum", lang="en", cal_attributes=False)
client.rougeL(references, hypothesis, task="sum", lang="en", cal_attributes=False)
Prompts can sometimes improve the performance for certain metrics (See this paper). In our client.score()
function, we support adding prompts to the source/hypothesis/references with both prefix position and suffix position. An example is shown below.
from eaas import Config, Client
config = Config()
client = Client()
client.load_config(config)
inputs = [
{
"source": "This is the source.",
"references": ["This is the reference one.", "This is two."],
"hypothesis": "This is the generated hypothesis."
}
]
prompt_info = {
"source": {"prefix": "This is source prefix", "suffix": "This is source suffix"},
"hypothesis": {"prefix": "This is hypothesis prefix", "suffix": "This is hypothesis suffix"},
"reference": {"prefix": "This is reference prefix", "suffix": "This is reference suffix"}
}
# adding this prompt info will automatically turn the inputs into
# [{'source': 'This is source prefix This is the source. This is source suffix',
# 'references': ['This is reference prefix This is the reference one. This is reference suffix', 'This is reference prefix This is two. This is reference suffix'],
# 'hypothesis': 'This is hypothesis prefix This is the generated hypothesis. This is hypothesis suffix'}]
# Here is a simpler example.
# prompt_info = {"source": {"prefix": "This is prefix"}}
score_dic = client.score(inputs, task="sum", metrics=["bart_score_summ"], lang="en", cal_attributes=False, **prompt_info)