Closed rsuwaileh closed 3 years ago
Please show me the evaluation snippet and the data.
I generated a small example from my dataset:
z_true = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'E-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'S-LOC', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O'],
['O', 'O', 'O', 'O', 'O', 'O', 'S-LOC', 'S-LOC', 'O', 'O', 'O', 'O', 'O'],
['O', 'O', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
['O', 'B-LOC', 'E-LOC', 'O', 'O', 'B-LOC', 'E-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
z_pred = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'E-LOC', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'S-LOC', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O'],
['O', 'O', 'O', 'O', 'O', 'O', 'S-LOC', 'S-LOC', 'B-LOC', 'I-LOC', 'E-LOC', 'O', 'O'],
['O', 'S-LOC', 'O', 'O', 'O', 'O', 'S-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
['O', 'O', 'O', 'B-LOC', 'E-LOC', 'B-LOC', 'E-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
scheme = IOBES
average = "micro"
evaluate(z_true, z_pred, scheme, average)
The results I get:
0.6666666666666666 0.8 0.7272727272727272
precision recall f1-score support
LOC 0.667 0.800 0.727 10
micro avg 0.667 0.800 0.727 10
macro avg 0.667 0.800 0.727 10
weighted avg 0.667 0.800 0.727 10
When I change the scheme to BILOU using the same example and lables above:
z_true = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'L-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'U-LOC', 'O', 'U-LOC', 'O', 'O', 'O', 'O', 'O', 'O'],
['O', 'O', 'O', 'O', 'O', 'O', 'U-LOC', 'U-LOC', 'O', 'O', 'O', 'O', 'O'],
['O', 'O', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'U-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
['O', 'B-LOC', 'L-LOC', 'O', 'O', 'B-LOC', 'L-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
z_pred = [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'L-LOC', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'U-LOC', 'O', 'U-LOC', 'O', 'O', 'O', 'O', 'O', 'O'],
['O', 'O', 'O', 'O', 'O', 'O', 'U-LOC', 'U-LOC', 'B-LOC', 'I-LOC', 'L-LOC', 'O', 'O'],
['O', 'U-LOC', 'O', 'O', 'O', 'O', 'U-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
['O', 'O', 'O', 'B-LOC', 'L-LOC', 'B-LOC', 'L-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
scheme = BILOU
average = "micro"
evaluate(z_true, z_pred, scheme, average)
I get the same P, R, & F1. However, the report is different. I'm using micro average with both schemes:
0.6666666666666666 0.8 0.7272727272727272
precision recall f1-score support
LOC 0.625 0.556 0.588 9
micro avg 0.625 0.556 0.588 9
macro avg 0.625 0.556 0.588 9
weighted avg 0.625 0.556 0.588 9
This is the evaluate function that uses seqeval:
def evaluate(y_true, y_pred, scheme, average):
print(precision_score(y_true, y_pred, average = average, mode='strict', scheme=scheme), end='\t')
print(recall_score(y_true, y_pred, average = average, mode='strict', scheme=scheme), end='\t')
print(f1_score(y_true, y_pred, average = average, mode='strict', scheme=scheme))
print(classification_report(y_true, y_pred, digits=3))
You just forgot to specify mode
and scheme
to classification_report
. If it's specified correctly, the result is the same:
def evaluate(y_true, y_pred, scheme, average):
print(precision_score(y_true, y_pred, average=average, mode='strict', scheme=scheme), end='\t')
print(recall_score(y_true, y_pred, average=average, mode='strict', scheme=scheme), end='\t')
print(f1_score(y_true, y_pred, average=average, mode='strict', scheme=scheme))
print(classification_report(y_true, y_pred, digits=3, mode='strict', scheme=scheme))
# IOBES
0.6666666666666666 0.8 0.7272727272727272
precision recall f1-score support
LOC 0.667 0.800 0.727 10
micro avg 0.667 0.800 0.727 10
macro avg 0.667 0.800 0.727 10
weighted avg 0.667 0.800 0.727 10
# BILOU
0.6666666666666666 0.8 0.7272727272727272
precision recall f1-score support
LOC 0.667 0.800 0.727 10
micro avg 0.667 0.800 0.727 10
macro avg 0.667 0.800 0.727 10
weighted avg 0.667 0.800 0.727 10
I compared the results of the same model on the same test data with both IOBES and BILOU schemes. I get exactly the same precision, recall, and F1 scores which I expect:
However, I get different classification reports as shown below! Any explanation for this? BILOU:
IOBES:
My Environment