ExpressAI / eaas_client

https://expressai.github.io/autoeval/
Apache License 2.0
0 stars 1 forks source link

Evaluation-as-a-Service for NLP



License GitHub stars PyPI Integration Tests

Usage

Detailed documentation can be found here. To install the API, simply run

pip install eaas

To use the API, You should go through the following two steps.

To see the metrics we support, run

print(config.metrics())

To see the default configuration of a metric, run

print(config.bleu.to_dict())

Here is an example of modifying the default config.

config.bleu.set_property("smooth_method", "floor") print(config.bleu.to_dict())


- **Step 2**: Initialize the client and send your inputs. Please send your inputs as a whole (a list of many dicts) instead of sending one sample at a time (which will be much slower).
```python
from eaas import Client
client = Client()
client.load_config(config)  # The config you have created above

# To use this API for scoring, you need to format your input as list of dictionary. 
# Each dictionary consists of `source` (string, optional), `references` (list of string, optional) and `hypothesis` (string, required). `source` and `references` are optional based on the metrics you want to use. 
# Please do not conduct any preprocessing on `source`, `references` or `hypothesis`. 
# We expect normal-cased detokenized texts. All the preprocessing steps are taken by the metrics. 
# Below is a simple example.

inputs = [{"source": "This is the source.", 
           "references": ["This is the reference one.", "This is the reference two."],
           "hypothesis": "This is the generated hypothesis."}]
metrics = ["bleu", "chrf"] # Can be None for simplicity if you consider using all metrics

score_dic = client.score(inputs, task="sum", metrics=metrics, lang="en", cal_attributes=True) 
# inputs is a list of Dict, task is the name of task (for calculating attributes), metrics is metric list, lang is the two-letter code language.
# You can also set cal_attributes=False to save some time since some attribute calculations can be slow.

The output is like

# sample_level is a list of dict, corpus_level is a dict
{
    'sample_level': [
        {
            'bleu': 32.46679154750991, 
            'attr_compression': 0.8333333333333334, 
            'attr_copy_len': 2.0, 
            'attr_coverage': 0.6666666666666666, 
            'attr_density': 1.6666666666666667, 
            'attr_hypothesis_len': 6, 
            'attr_novelty': 0.6, 
            'attr_repetition': 0.0, 
            'attr_source_len': 5, 
            'chrf': 38.56890099861521
        }
    ], 
    'corpus_level': {
        'corpus_bleu': 32.46679154750991, 
        'corpus_attr_compression': 0.8333333333333334, 
        'corpus_attr_copy_len': 2.0, 
        'corpus_attr_coverage': 0.6666666666666666, 
        'corpus_attr_density': 1.6666666666666667, 
        'corpus_attr_hypothesis_len': 6.0, 
        'corpus_attr_novelty': 0.6, 
        'corpus_attr_repetition': 0.0, 
        'corpus_attr_source_len': 5.0, 
        'corpus_chrf': 38.56890099861521
    }
}

Supported Metrics

Currently, EaaS supports the following metrics:

Support for Attributes

The task option in the client.score() function decides what attributes we calculate. Currently, we only support attributes for summarization task (task=sum). The following attributes (reference: this paper) will be calculated if cal_attributes is set to True in client.score(). They are all reference-free.

Support for Common Metrics

We support quick calculation for BLEU and ROUGE(1,2,L), see the following for usage.

from eaas import Config, Client
config = Config()
client = Client()
client.load_config(config) 

# Note that the input format is different from the `score` function. 
references = [["This is the reference one for sample one.", "This is the reference two for sample one."],
              ["This is the reference one for sample two.", "This is the reference two for sample two."]]
hypothesis = ["This is the generated hypothesis for sample one.", 
              "This is the generated hypothesis for sample two."]

# Calculate BLEU
client.bleu(references, hypothesis, task="sum", lang="en", cal_attributes=False)

# Calculate ROUGEs
client.rouge1(references, hypothesis, task="sum", lang="en", cal_attributes=False)
client.rouge2(references, hypothesis, task="sum", lang="en", cal_attributes=False)
client.rougeL(references, hypothesis, task="sum", lang="en", cal_attributes=False)

Support for Prompts

Prompts can sometimes improve the performance for certain metrics (See this paper). In our client.score() function, we support adding prompts to the source/hypothesis/references with both prefix position and suffix position. An example is shown below.

from eaas import Config, Client
config = Config()
client = Client()
client.load_config(config)

inputs = [
    {
        "source": "This is the source.",
        "references": ["This is the reference one.", "This is two."],
        "hypothesis": "This is the generated hypothesis."
    }
]

prompt_info = {
    "source": {"prefix": "This is source prefix", "suffix": "This is source suffix"},
    "hypothesis": {"prefix": "This is hypothesis prefix", "suffix": "This is hypothesis suffix"},
    "reference": {"prefix": "This is reference prefix", "suffix": "This is reference suffix"}
}

# adding this prompt info will automatically turn the inputs into
# [{'source': 'This is source prefix This is the source. This is source suffix', 
#   'references': ['This is reference prefix This is the reference one. This is reference suffix', 'This is reference prefix This is two. This is reference suffix'], 
#   'hypothesis': 'This is hypothesis prefix This is the generated hypothesis. This is hypothesis suffix'}]

# Here is a simpler example.
# prompt_info = {"source": {"prefix": "This is prefix"}}

score_dic = client.score(inputs, task="sum", metrics=["bart_score_summ"], lang="en", cal_attributes=False, **prompt_info)