For multilabel text classification, the best F1 score of -autotune-validation not equal to model evaluation

I do model train :

./fasttext supervised -input /home/data/20201018_20201018_train.tsv -output /home/data/output_models/fasttext -autotune-validation /home/data/20201018_20201018_dev.tsv -autotune-duration 36000 -loss one-vs-all

the result is below:

Warning : loss is manually set to a specific value. It will not be automatically optimized.
Progress: 100.0% Trials: 1422 Best score:  0.662303 ETA:   0h 0m 0s
Training again with best arguments
Read 0M words
Number of words:  3254
Number of labels: 7
Progress: 100.0% words/sec/thread:   96354 lr:  0.000000 avg.loss:  0.111543 ETA:   0h 0m 0s

From the result, we can get the best F1 score is 0.662303, as https://github.com/facebookresearch/fastText/issues/914 say the F1
score for multilabel text classification, therefore the F1 score should equal to 2P_microR_micro/(P_micro+R_micro). Then I do model evaluation：

./fasttext test /home/data/output_models/fasttext.bin /home/data/20201018_20201018_dev.tsv -1 0.5

the result is：

N   6428
P@-1    0.717
R@-1    0.601

However, 20.7170.601/(0.717+0.601)=0.6538952959028831, which is not equal to The Best F1 score: 0.662303. Why ？ Best regards！

facebookresearch / fastText

For multilabel text classification, the best F1 score of -autotune-validation not equal to model evaluation #1144