Probably there is error in calculation of macro F1. Flair is taking averages of class precisions and recalls and then calculates F1. As far as I know, it should be F1 calculated over classes and then averaged - it would be consistent with sklearn.metrics.f1_score(labels, predicitons, average='macro').
Probably there is error in calculation of macro F1. Flair is taking averages of class precisions and recalls and then calculates F1. As far as I know, it should be F1 calculated over classes and then averaged - it would be consistent with
sklearn.metrics.f1_score(labels, predicitons, average='macro')
.