First of all, we need to evaluate the overall performance of the system.
Secondly, we need to evaluate the performance of each classifier.
Thirdly, we need to do error analysis for each stage on the pipeline, and adjust the model.
Finally, we iterate the three steps above until a satisfying performance is achieved.
Evaluation Targets
We need to perform evaluation on the following items.
Retrieved items
Unordered retrieval measures
Ordered retrieval measures
concepts mean
percision, recall, F-measure
MAP,GMAP
articles
mean percision, recall, F-measure
MAP,GMAP
triples
mean percision, recall, F-measure
MAP,GMAP
Flat Evaluation
We need to perform evaluations for each classifier. The following measures should be taken.
Precision: P = TPTP+FP
Recall: R = TPTP+FN
F-measure: F = 2P*RP+R
Average Precision: AP = r=1|L|P(r)*rel(r)LR, where |L| is the number of items, and |LR| is the number of relevant items.
Mean Average Precision: MAP = 1ni=1nAPi, for a list of queries: q1, q2, …, qn
Geometric Mean Average Precision: GMAP = ni=1n(APi+e) , e is a small value for smoothing
Hierarchical Evaluation
The classification is hierarchical so flat evaluation measures do not work sufficiently. In the multiple levels of classification, once there is an error in one classifier the final result is incorrect. Flat measures fail for this case because we could not tell the cause of the error by its evaluation. So we need to design a hierarchy of measurements, taking the relations and performance for each classifier into consideration.
Kiritchenko et al. proposed a hierarchical precision as:
precision = |An(Cp) An(Ct)||An(Cp)|
where Cp is the set of predicted categories, An(Cp) is the set of ancestors of Cp, Ct is the set of true categories and An(Ct) is the set of ancestors of Ct.
For the general evaluation, flat micro-F1 measure will be used.
MiF1 = 2_MiP_MiRMiP+MiR, where MiP and MiR are defined as following.
MiP = i=1tpcii=1(tpci+fpci)
MiR = i=1tpcii=1(tpci+fnci)
Also, the F-measure (LCaF), precision (LCaP), and recall (LCaR) will be applied to the lowest common ancestor (LCA) on the hierarchy too.
Methodology
First of all, we need to evaluate the overall performance of the system.
Secondly, we need to evaluate the performance of each classifier.
Thirdly, we need to do error analysis for each stage on the pipeline, and adjust the model.
Finally, we iterate the three steps above until a satisfying performance is achieved.
Evaluation Targets
We need to perform evaluation on the following items.
Flat Evaluation
We need to perform evaluations for each classifier. The following measures should be taken.
Hierarchical Evaluation
The classification is hierarchical so flat evaluation measures do not work sufficiently. In the multiple levels of classification, once there is an error in one classifier the final result is incorrect. Flat measures fail for this case because we could not tell the cause of the error by its evaluation. So we need to design a hierarchy of measurements, taking the relations and performance for each classifier into consideration. Kiritchenko et al. proposed a hierarchical precision as:
For the general evaluation, flat micro-F1 measure will be used.