Major Changes

metrics.semantics.SemanticMetric type is added from which we have 2 implementations: BertScore and BartScore
two new basic metrics are added:
- metrics.basics.ExactMatchMetric: Flattens all the references (in the multi-reference scenario) and performs an exact match
- metrics.basics.ConfusionMatrix: Generates confusion matrix for the provided set of labels (obtained from both references and predictions) using scikit-learn
initial unit tests (based on pytest) are added at tests/. (For now, it only tests for the format_to_jury function (it's very basic for now).

Minor Changes

models.defaults.DefaultQAModelWrapper now accepts the device param to run on CPU or GPU
models._base.HFPipelineWrapper has a pipeline property that returns the pipeline
evalem.misc.datasets.get_squad_v2(...) utility function is added to load squad-v2 dataset

Usages


from evalem.structures import (
    PredictionDTO,
    ReferenceDTO,
    EvaluationDTO,
    PredictionInstance,
    ReferenceInstance
)

from evalem.metrics import (
    Metric,
    AccuracyMetric,
    PrecisionMetric,
    RecallMetric,
    F1Metric,
    ConfusionMatrix,
    ExactMatchMetric,
    BertScore,
    BartScore,
)

from typing import Iterable, Type, Mapping, Union, List

from evalem.models import (
    DefaultQAModelWrapper,
    HFPipelineWrapper,
    ModelWrapper
)

from evalem.misc.datasets import get_squad_v2

# wrapped_model = HFPipelineWrapper(
#     pipeline("question-answering"),
# )

wrapped_model = DefaultQAModelWrapper(device="cpu")

def run_pipeline(
    model: Type[ModelWrapper],
    evaluators: Iterable[Type[Evaluator]],
    inputs,
    references
) -> Iterable[Mapping[str, dict]]:
    predictions = model(inputs)
    evaluators = [evaluators] if not isinstance(evaluators, Iterable) else evaluators
    return list(map(lambda e: e(predictions=predictions, references=references), evaluators))

data = get_squad_v2("validation", nsamples=100)

evaluators = [
    Evaluator(metrics=[
        AccuracyMetric(),
        ConfusionMatrix(),
        ExactMatchMetric(),
    ]),
    Evaluator(metrics=[
        BertScore(device="mps", model_type="distilbert-base-uncased"),
        BartScore(device="mps")
    ])
]

results = run_pipeline(
    wrapped_model,
    evaluators,
    data["inputs"],
    data["references"]
)

NASA-IMPACT / evalem

[alpha] Implementation of Semantic metrics and more basic metrics #6

Major Changes

Minor Changes

Usages