Improve evaluation - Githubissues

Status	Type	⚠️ Core Change	Issue
Ready	Feature	No

Summary

Added create_dataset:

Function for creating a DatasetWithEntities from sentences and ground truth.
Add token-based evaluation

Use span-based evaluation or token-based evaluation

span-based Span-based evaluation (by default) consider each BIO tag for each token, and it's only correct if all the tokens in the entity span are recognized with their corresponding BIO tag.

token-based Token-based evaluation consider only the B- tag for each token, and it's measured at token level.

In the next example, the SMXM Linker will extract York as LOC, but not New New York. Thus, the span-based f1 is 0.0 as the span was not fully recognized. On the other hand, the token-based f1 is 0.5, as the precision is 1.0 and the recall is 0.3333 (1 token correctly extracted out of three).

Example

import spacy
from zshot import PipelineConfig
from zshot.linker import LinkerSMXM

from zshot.evaluation.metrics.seqeval.seqeval import Seqeval
from zshot.evaluation.dataset.dataset import create_dataset
from zshot.evaluation.zshot_evaluate import evaluate, prettify_evaluate_report
from zshot.utils.data_models import Entity

ENTITIES = [
    Entity(name="FAC", description="A facility"),
    Entity(name="LOC", description="A location"),
]
sentences = ["New New York is beautiful"]
gt = [["B-LOC", "I-LOC", "I-LOC", "O", "O"]]

dataset = create_dataset(gt, sentences, ENTITIES)

nlp = spacy.blank("en")
nlp_config = PipelineConfig(
    linker=LinkerSMXM(),
    entities=ENTITIES
)

nlp.add_pipe("zshot", config=nlp_config, last=True)

evaluation = evaluate(nlp, dataset, metric=Seqeval())
tables = prettify_evaluate_report(evaluation, show_full_report=False)
print("F1-Macro span-based:", evaluation['linker']['overall_f1_macro'])

evaluation = evaluate(nlp, dataset, metric=Seqeval(), mode='token')
tables = prettify_evaluate_report(evaluation, show_full_report=False)
print("F1-Macro token-based:", evaluation['linker']['overall_f1_macro'])

> Map:   0%|          | 0/1 [00:00<?, ? examples/s]
> F1-Macro span-based: 0.0
> Map:   0%|          | 0/1 [00:00<?, ? examples/s]
> F1-Macro token-based: 0.5

IBM / zshot

Improve evaluation #62

Summary