Are the evaluation metrics computed correctly?

gballardin commented 3 years ago

The README has output examples in which for each model {BERT, GANBERT}, {eval_accuracy, eval_f1_micro, eval_precision, eval_recall} are all identical. For example, in the case of BERT:

eval_accuracy = 0.136 
eval_f1_macro = 0.010410878 
eval_f1_micro = 0.136 
eval_loss = 3.7638452 
eval_precision = 0.136 
eval_recall = 0.136

I ran your model with sh run_experiment.sh, got numerically different results, but the same equality across {eval_accuracy, eval_f1_micro, eval_precision, eval_recall} within each model persists. For example, for GANBERT I get:

eval_accuracy = 0.514 
eval_f1_macro = 0.15001474 
eval_f1_micro = 0.514 
eval_loss = 2.1689985 
eval_precision = 0.514 
eval_recall = 0.514 
global_step = 276 
loss = 5.6168394

Is it because you're micro averaging and therefore "micro-F1 = micro-precision = micro-recall = accuracy"?

kamei86i commented 3 years ago

yes, they are micro averaged.

utkarsh512 commented 3 years ago

I faced the same issue; so, I updated the code to print the confusion matrix as per the labels provided in data_processors.py. This repository might help!

crux82 / ganbert

Are the evaluation metrics computed correctly? #14