Failed to reproduce MTEB results

ThonyPan commented 1 month ago

Hi @vaibhavad,

I’ve been trying to reproduce your work using the recently published MTEB evaluation script. However, the results of the tested subtasks using the Mistral-7b-Instruct-v2-nmtp-unsupervised model show discrepancies compared to the results reported in Table 11 of your paper. Specifically, the performance on most tasks is lower than what was reported.

I have observed that when using bf16 for model inference, parameters like batch_size significantly affect the outcomes. Could you please provide a more detailed and feasible tutorial to help reproduce the results as reported in the paper?

Thank you!

Dataset	Reported	Reproduced	Difference
AmazonCounterfactualClassification	76.94	74.84	-2.10
AmazonPolarityClassification	85.29	80.48	-4.81
AmazonReviewsClassification	47.09	42.77	-4.32
ArxivClusteringP2P	47.56	47.67	0.11
ArxivClusteringS2S	39.92	39.97	0.05
AskUbuntuDupQuestions	58.60	57.84	-0.76
BIOSSES	83.29	83.58	0.29
Banking77Classification	86.16	85.44	-0.72
STS12	67.65	64.65	-3.00
STS13	83.90	82.70	-1.20
STS14	76.97	75.54	-1.43
STS15	83.80	83.26	-0.54
STS16	81.91	81.54	-0.37
STS17	85.58	85.40	-0.18
STS22	65.93	66.14	0.21
STSBenchmark	80.42	79.60	-0.82
SummEval	30.19	30.00	-0.19

vaibhavad commented 1 month ago

Hi @ThonyPan,

can you provide more details about the exact command that you ran?

I tried the following commands on two datasets -

AmazonCounterfactualClassification

python experiments/mteb_eval.py --model_name McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-unsup-simcse --task_name AmazonCounterfactualClassification --task_to_instructions_fp test_configs/mteb/task_to_instructions.json --output_dir results

STS16

python experiments/mteb_eval.py --model_name McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-unsup-simcse --task_name STS16 --task_to_instructions_fp test_configs/mteb/task_to_instructions.json --output_dir results

Here are the results

Dataset	Reported	Reproduced	Difference
AmazonCounterfactualClassification	76.94	77.82	+0.88
STS16	81.91	81.89	-0.02

These evals were run on single A100 with default batch size of 32.

We are aware that BF16 causes a lot of discrepancy with batch size, hardware etc. Unfortunately, there issues are related to huggingface and pytorch, hence they cannot be fixed by us. However, so far in our experience, the differences are very marginal.

Let me know if you have any more questions.

ThonyPan commented 1 month ago

Thank you for your response！

I have completed the model evaluation using the exactly same script you mentioned on both 40GB A100 and 3090 GPUs, but unfortunately, I was unable to reproduce the results. I suspect that the discrepancies might be due to differences in the versions of flash attention, torch, mteb, etc. If possible, could you please provide a pip list or conda list? This would greatly help me replicate the environment in my experiments.

vaibhavad commented 1 month ago

@ThonyPan,

Here is the yaml of the conda env that I used for evaluation.

llm2vec was built from source, but the latest version (0.2.0) should also give similar results.

name: /home/llm2vec/.conda
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2024.7.2=h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_1
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=1.41.5=h5eee18b_0
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.14=h5eee18b_0
  - pip=24.0=py310h06a4308_0
  - python=3.10.14=h955ad1f_1
  - readline=8.2=h5eee18b_0
  - setuptools=69.5.1=py310h06a4308_0
  - sqlite=3.45.3=h5eee18b_0
  - tk=8.6.14=h39e8969_0
  - wheel=0.43.0=py310h06a4308_0
  - xz=5.4.6=h5eee18b_1
  - zlib=1.2.13=h5eee18b_1
  - pip:
      - accelerate==0.32.1
      - aiohttp==3.9.5
      - aiosignal==1.3.1
      - annotated-types==0.7.0
      - async-timeout==4.0.3
      - attrs==23.2.0
      - certifi==2024.7.4
      - charset-normalizer==3.3.2
      - datasets==2.20.0
      - dill==0.3.8
      - eval-type-backport==0.2.0
      - evaluate==0.4.2
      - filelock==3.15.4
      - frozenlist==1.4.1
      - fsspec==2024.5.0
      - huggingface-hub==0.23.4
      - idna==3.7
      - jinja2==3.1.4
      - joblib==1.4.2
      - markdown-it-py==3.0.0
      - markupsafe==2.1.5
      - mdurl==0.1.2
      - mpmath==1.3.0
      - mteb==1.12.75
      - multidict==6.0.5
      - multiprocess==0.70.16
      - networkx==3.3
      - numpy==1.26.4
      - nvidia-cublas-cu12==12.1.3.1
      - nvidia-cuda-cupti-cu12==12.1.105
      - nvidia-cuda-nvrtc-cu12==12.1.105
      - nvidia-cuda-runtime-cu12==12.1.105
      - nvidia-cudnn-cu12==8.9.2.26
      - nvidia-cufft-cu12==11.0.2.54
      - nvidia-curand-cu12==10.3.2.106
      - nvidia-cusolver-cu12==11.4.5.107
      - nvidia-cusparse-cu12==12.1.0.106
      - nvidia-nccl-cu12==2.20.5
      - nvidia-nvjitlink-cu12==12.5.82
      - nvidia-nvtx-cu12==12.1.105
      - packaging==24.1
      - pandas==2.2.2
      - peft==0.11.1
      - pillow==10.4.0
      - polars==1.1.0
      - psutil==6.0.0
      - pyarrow==16.1.0
      - pyarrow-hotfix==0.6
      - pydantic==2.8.2
      - pydantic-core==2.20.1
      - pygments==2.18.0
      - python-dateutil==2.9.0.post0
      - pytrec-eval-terrier==0.5.6
      - pytz==2024.1
      - pyyaml==6.0.1
      - regex==2024.5.15
      - requests==2.32.3
      - rich==13.7.1
      - safetensors==0.4.3
      - scikit-learn==1.5.1
      - scipy==1.14.0
      - sentence-transformers==3.0.1
      - six==1.16.0
      - sympy==1.13.0
      - threadpoolctl==3.5.0
      - tokenizers==0.19.1
      - torch==2.3.1
      - tqdm==4.66.4
      - transformers==4.40.2
      - triton==2.3.1
      - typing-extensions==4.12.2
      - tzdata==2024.1
      - urllib3==2.2.2
      - xxhash==3.4.1
      - yarl==1.9.4

ThonyPan commented 1 month ago

Thanks again for your response. I will close the issue and try to reproduce using the provided environment.

McGill-NLP / llm2vec

Failed to reproduce MTEB results #122