results of the best checkpoint are different between training and evaluation

At the end of the training log, the results of the best chechpoint:

training completed...

--------------------------------------best--------------------------------------
[best] epoch: 25
[loss] loss: 44.52341
[loss] ref_loss: 16.75534
[loss] ref_mask_loss: 0.0
[loss] lang_cls_loss: 0.22115
[loss] objectness_loss: 0.33091
[loss] kps_loss: 0.0285
[loss] box_loss: 2.68898
[loss] sem_cls_loss: 5.56197
[loss] lang_cls_acc: 0.93388
[sco.] ref_acc: 0.14872
[sco.] obj_acc: 0.76845
[sco.] pos_ratio: 0.68719, neg_ratio: 0.31281
[sco.] iou_rate_0.25: 0.47397, iou_rate_0.5: 0.36692

saving checkpoint...

saving last models...

After the training, I run the command for evaluation: CUDA_VISIBLE_DEVICES=0 python scripts/eval.py --config ./config/sps.yaml --folder 2023-05-07_00-36_SPS/ --reference --no_nms --force :

unique:
unique | not_in_others | ref_acc: 0.14891243725599554
unique | not_in_others | acc@0.25iou: 0.8120468488566648
unique | not_in_others | acc@0.5iou: 0.6447295036252092
unique | in_others | ref_acc: 0.09615384615384616
unique | in_others | acc@0.25iou: 0.7692307692307693
unique | in_others | acc@0.5iou: 0.5961538461538461
unique | overall | ref_acc: 0.14742547425474256
unique | overall | acc@0.25iou: 0.810840108401084
unique | overall | acc@0.5iou: 0.643360433604336

multiple:
multiple | not_in_others | ref_acc: 0.07918758557736194
multiple | not_in_others | acc@0.25iou: 0.3247375627567321
multiple | not_in_others | acc@0.5iou: 0.26449109995435877
multiple | in_others | ref_acc: 0.2307223407497714
multiple | in_others | acc@0.25iou: 0.4687595245352027
multiple | in_others | acc@0.5iou: 0.32855836635172203
multiple | overall | ref_acc: 0.14406890251859586
multiple | overall | acc@0.25iou: 0.38640219235286444
multiple | overall | acc@0.5iou: 0.29192222367219106

overall:
overall | not_in_others | ref_acc: 0.0994331983805668
overall | not_in_others | acc@0.25iou: 0.4662348178137652
overall | not_in_others | acc@0.5iou: 0.3748987854251012
overall | in_others | ref_acc: 0.22862286228622863
overall | in_others | acc@0.25iou: 0.4734473447344735
overall | in_others | acc@0.5iou: 0.3327332733273327
overall | overall | ref_acc: 0.1447202355910812
overall | overall | acc@0.25iou: 0.4687631468237274
overall | overall | acc@0.5iou: 0.3601177955405974

language classification accuracy: 0.9309404022447408

The best overall accuracy during training is The overall acc is iou_rate_0.25: 0.47397, iou_rate_0.5: 0.36692, but in the evaluation the best one is overall | overall | acc@0.25iou: 0.4687631468237274 overall | overall | acc@0.5iou: 0.3601177955405974.
Why is there such a discrepancy? Did I make a mistake somewhere?

fjhzhixi / 3D-SPS

results of the best checkpoint are different between training and evaluation #15