chakki-works / seqeval

A Python framework for sequence labeling evaluation(named-entity recognition, pos tagging, etc...)
MIT License
1.09k stars 129 forks source link

Different Classification Reports for IOBES and BILOU #71

Closed rsuwaileh closed 3 years ago

rsuwaileh commented 3 years ago

I compared the results of the same model on the same test data with both IOBES and BILOU schemes. I get exactly the same precision, recall, and F1 scores which I expect:

Precision = 0.6762295081967213
Recall = 0.5045871559633027
F1 = 0.5779334500875658

However, I get different classification reports as shown below! Any explanation for this? BILOU:

              precision    recall  f1-score   support

         LOC      0.676     0.505     0.578       327

   micro avg      0.676     0.505     0.578       327
   macro avg      0.676     0.505     0.578       327
weighted avg      0.676     0.505     0.578       327

IOBES:

              precision    recall  f1-score   support

         LOC      0.667     0.503     0.574       314

   micro avg      0.667     0.503     0.574       314
   macro avg      0.667     0.503     0.574       314
weighted avg      0.667     0.503     0.574       314

My Environment

Hironsan commented 3 years ago

Please show me the evaluation snippet and the data.

rsuwaileh commented 3 years ago

I generated a small example from my dataset:

z_true = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'E-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'S-LOC', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'O', 'O', 'O', 'O', 'S-LOC', 'S-LOC', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
['O', 'B-LOC', 'E-LOC', 'O', 'O', 'B-LOC', 'E-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]

z_pred = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'E-LOC', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'S-LOC', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'O', 'O', 'O', 'O', 'S-LOC', 'S-LOC', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'O'], 
['O', 'S-LOC', 'O', 'O', 'O', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'O', 'B-LOC', 'E-LOC', 'B-LOC', 'E-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
scheme = IOBES
average = "micro"
evaluate(z_true, z_pred, scheme, average)

The results I get:

0.6666666666666666  0.8 0.7272727272727272
              precision    recall  f1-score   support

         LOC      0.667     0.800     0.727        10

   micro avg      0.667     0.800     0.727        10
   macro avg      0.667     0.800     0.727        10
weighted avg      0.667     0.800     0.727        10

When I change the scheme to BILOU using the same example and lables above:

z_true = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'L-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'U-LOC', 'O', 'U-LOC', 'O', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'O', 'O', 'O', 'O', 'U-LOC', 'U-LOC', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'U-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
['O', 'B-LOC', 'L-LOC', 'O', 'O', 'B-LOC', 'L-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]

z_pred = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'L-LOC', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'U-LOC', 'O', 'U-LOC', 'O', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'O', 'O', 'O', 'O', 'U-LOC', 'U-LOC', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'O'], 
['O', 'U-LOC', 'O', 'O', 'O', 'O', 'U-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 
['O', 'O', 'O', 'B-LOC', 'L-LOC', 'B-LOC', 'L-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
scheme = BILOU
average = "micro"
evaluate(z_true, z_pred, scheme, average)

I get the same P, R, & F1. However, the report is different. I'm using micro average with both schemes:

0.6666666666666666  0.8 0.7272727272727272
              precision    recall  f1-score   support

         LOC      0.625     0.556     0.588         9

   micro avg      0.625     0.556     0.588         9
   macro avg      0.625     0.556     0.588         9
weighted avg      0.625     0.556     0.588         9

This is the evaluate function that uses seqeval:

def evaluate(y_true, y_pred, scheme, average):
    print(precision_score(y_true, y_pred, average = average, mode='strict', scheme=scheme), end='\t')
    print(recall_score(y_true, y_pred, average = average, mode='strict', scheme=scheme), end='\t')
    print(f1_score(y_true, y_pred, average = average, mode='strict', scheme=scheme))
    print(classification_report(y_true, y_pred, digits=3))
Hironsan commented 3 years ago

You just forgot to specify mode and scheme to classification_report. If it's specified correctly, the result is the same:

def evaluate(y_true, y_pred, scheme, average):
    print(precision_score(y_true, y_pred, average=average, mode='strict', scheme=scheme), end='\t')
    print(recall_score(y_true, y_pred, average=average, mode='strict', scheme=scheme), end='\t')
    print(f1_score(y_true, y_pred, average=average, mode='strict', scheme=scheme))
    print(classification_report(y_true, y_pred, digits=3, mode='strict', scheme=scheme))

# IOBES
0.6666666666666666      0.8     0.7272727272727272
              precision    recall  f1-score   support

         LOC      0.667     0.800     0.727        10

   micro avg      0.667     0.800     0.727        10
   macro avg      0.667     0.800     0.727        10
weighted avg      0.667     0.800     0.727        10

# BILOU
0.6666666666666666      0.8     0.7272727272727272
              precision    recall  f1-score   support

         LOC      0.667     0.800     0.727        10

   micro avg      0.667     0.800     0.727        10
   macro avg      0.667     0.800     0.727        10
weighted avg      0.667     0.800     0.727        10