The bge-m3 mldr score for lang zh

adol001 commented 3 months ago

Regarding the bge-m3 mldr score for lang zh , I am currently using

python FlagEmbedding/C_MTEB/MLDR/mteb_dense_eval/eval_MLDR.py  --encoder /data/models/bge-m3 --languages zh --results_save_path /data/models/mldr_results --max_query_length 512 --max_passage_length 8192 --batch_size 256 --corpus_batch_size 1 --pooling_method cls --normalize_embeddings True --add_instruction False --overwrite True

The obtained ndcg_at_10 score is only 0.26017, which is significantly different from what is reported in the paper. Why might this be the case?"


{
  "dataset_revision": null,
  "mteb_dataset_name": "MultiLongDocRetrieval",
  "mteb_version": "1.1.1",
  "test": {
    "evaluation_time": 11543.47,
    "zh": {
      "map_at_1": 0.175,
      "map_at_10": 0.22987,
      "map_at_100": 0.23761,
      "map_at_1000": 0.23846,
      "map_at_3": 0.21396,
      "map_at_5": 0.22208,
      "mrr_at_1": 0.175,
      "mrr_at_10": 0.22987,
      "mrr_at_100": 0.23761,
      "mrr_at_1000": 0.23846,
      "mrr_at_3": 0.21396,
      "mrr_at_5": 0.22208,
      "ndcg_at_1": 0.175,
      "ndcg_at_10": 0.26017,
      "ndcg_at_100": 0.3002,
      "ndcg_at_1000": 0.32686,
      "ndcg_at_3": 0.22644,
      "ndcg_at_5": 0.24123,
      "precision_at_1": 0.175,
      "precision_at_10": 0.03575,
      "precision_at_100": 0.00551,
      "precision_at_1000": 0.00077,
      "precision_at_3": 0.0875,
      "precision_at_5": 0.05975,
      "recall_at_1": 0.175,
      "recall_at_10": 0.3575,
      "recall_at_100": 0.55125,
      "recall_at_1000": 0.77,
      "recall_at_3": 0.2625,
      "recall_at_5": 0.29875
    }
  }
}

hanhainebula commented 3 months ago

Hello, the ndcg@10 of bge-m3-Dense on the MLDR zh dev split is 0.260, as reported in our paper. The values reported in our paper were all multiplied by 100.

adol001 commented 3 months ago

@hanhainebula Based on the results from mldr, using an 8k window for Chinese embedding with bge-m3 does not seem to have practical value. Would it be better if I reduced the training material to 1k? How should I trim the mldr dataset to test only 1k window texts?

If you have tested the results of the mldr with a 1k window for Chinese, I sincerely hope you can show them here.

hanhainebula commented 3 months ago

We didn't evaluate the results when max_passage_length=1024. You can set max_passage_length=1024 to perform evaluation to get the corresponding results:

python FlagEmbedding/C_MTEB/MLDR/mteb_dense_eval/eval_MLDR.py \
 --encoder /data/models/bge-m3 --languages zh \
--results_save_path /data/models/mldr_results \
--max_query_length 512 --max_passage_length 1024 \
--batch_size 256 --corpus_batch_size 8 \
--pooling_method cls --normalize_embeddings True \
--add_instruction False --overwrite True

FlagOpen / FlagEmbedding

The bge-m3 mldr score for lang zh #927