McGill-NLP / llm2vec

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'
https://mcgill-nlp.github.io/llm2vec/
MIT License
1.08k stars 78 forks source link

Failed to reproduce MTEB results #122

Closed ThonyPan closed 1 month ago

ThonyPan commented 1 month ago

Hi @vaibhavad,

I’ve been trying to reproduce your work using the recently published MTEB evaluation script. However, the results of the tested subtasks using the Mistral-7b-Instruct-v2-nmtp-unsupervised model show discrepancies compared to the results reported in Table 11 of your paper. Specifically, the performance on most tasks is lower than what was reported.

I have observed that when using bf16 for model inference, parameters like batch_size significantly affect the outcomes. Could you please provide a more detailed and feasible tutorial to help reproduce the results as reported in the paper?

Thank you!

Dataset Reported Reproduced Difference
AmazonCounterfactualClassification 76.94 74.84 -2.10
AmazonPolarityClassification 85.29 80.48 -4.81
AmazonReviewsClassification 47.09 42.77 -4.32
ArxivClusteringP2P 47.56 47.67 0.11
ArxivClusteringS2S 39.92 39.97 0.05
AskUbuntuDupQuestions 58.60 57.84 -0.76
BIOSSES 83.29 83.58 0.29
Banking77Classification 86.16 85.44 -0.72
STS12 67.65 64.65 -3.00
STS13 83.90 82.70 -1.20
STS14 76.97 75.54 -1.43
STS15 83.80 83.26 -0.54
STS16 81.91 81.54 -0.37
STS17 85.58 85.40 -0.18
STS22 65.93 66.14 0.21
STSBenchmark 80.42 79.60 -0.82
SummEval 30.19 30.00 -0.19
vaibhavad commented 1 month ago

Hi @ThonyPan,

can you provide more details about the exact command that you ran?

I tried the following commands on two datasets -

AmazonCounterfactualClassification

python experiments/mteb_eval.py --model_name McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-unsup-simcse --task_name AmazonCounterfactualClassification --task_to_instructions_fp test_configs/mteb/task_to_instructions.json --output_dir results

STS16

python experiments/mteb_eval.py --model_name McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-unsup-simcse --task_name STS16 --task_to_instructions_fp test_configs/mteb/task_to_instructions.json --output_dir results

Here are the results

Dataset Reported Reproduced Difference
AmazonCounterfactualClassification 76.94 77.82 +0.88
STS16 81.91 81.89 -0.02

These evals were run on single A100 with default batch size of 32.

We are aware that BF16 causes a lot of discrepancy with batch size, hardware etc. Unfortunately, there issues are related to huggingface and pytorch, hence they cannot be fixed by us. However, so far in our experience, the differences are very marginal.

Let me know if you have any more questions.

ThonyPan commented 1 month ago

Thank you for your response!

I have completed the model evaluation using the exactly same script you mentioned on both 40GB A100 and 3090 GPUs, but unfortunately, I was unable to reproduce the results. I suspect that the discrepancies might be due to differences in the versions of flash attention, torch, mteb, etc. If possible, could you please provide a pip list or conda list? This would greatly help me replicate the environment in my experiments.

vaibhavad commented 1 month ago

@ThonyPan,

Here is the yaml of the conda env that I used for evaluation.

llm2vec was built from source, but the latest version (0.2.0) should also give similar results.

name: /home/llm2vec/.conda
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2024.7.2=h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_1
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=1.41.5=h5eee18b_0
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.14=h5eee18b_0
  - pip=24.0=py310h06a4308_0
  - python=3.10.14=h955ad1f_1
  - readline=8.2=h5eee18b_0
  - setuptools=69.5.1=py310h06a4308_0
  - sqlite=3.45.3=h5eee18b_0
  - tk=8.6.14=h39e8969_0
  - wheel=0.43.0=py310h06a4308_0
  - xz=5.4.6=h5eee18b_1
  - zlib=1.2.13=h5eee18b_1
  - pip:
      - accelerate==0.32.1
      - aiohttp==3.9.5
      - aiosignal==1.3.1
      - annotated-types==0.7.0
      - async-timeout==4.0.3
      - attrs==23.2.0
      - certifi==2024.7.4
      - charset-normalizer==3.3.2
      - datasets==2.20.0
      - dill==0.3.8
      - eval-type-backport==0.2.0
      - evaluate==0.4.2
      - filelock==3.15.4
      - frozenlist==1.4.1
      - fsspec==2024.5.0
      - huggingface-hub==0.23.4
      - idna==3.7
      - jinja2==3.1.4
      - joblib==1.4.2
      - markdown-it-py==3.0.0
      - markupsafe==2.1.5
      - mdurl==0.1.2
      - mpmath==1.3.0
      - mteb==1.12.75
      - multidict==6.0.5
      - multiprocess==0.70.16
      - networkx==3.3
      - numpy==1.26.4
      - nvidia-cublas-cu12==12.1.3.1
      - nvidia-cuda-cupti-cu12==12.1.105
      - nvidia-cuda-nvrtc-cu12==12.1.105
      - nvidia-cuda-runtime-cu12==12.1.105
      - nvidia-cudnn-cu12==8.9.2.26
      - nvidia-cufft-cu12==11.0.2.54
      - nvidia-curand-cu12==10.3.2.106
      - nvidia-cusolver-cu12==11.4.5.107
      - nvidia-cusparse-cu12==12.1.0.106
      - nvidia-nccl-cu12==2.20.5
      - nvidia-nvjitlink-cu12==12.5.82
      - nvidia-nvtx-cu12==12.1.105
      - packaging==24.1
      - pandas==2.2.2
      - peft==0.11.1
      - pillow==10.4.0
      - polars==1.1.0
      - psutil==6.0.0
      - pyarrow==16.1.0
      - pyarrow-hotfix==0.6
      - pydantic==2.8.2
      - pydantic-core==2.20.1
      - pygments==2.18.0
      - python-dateutil==2.9.0.post0
      - pytrec-eval-terrier==0.5.6
      - pytz==2024.1
      - pyyaml==6.0.1
      - regex==2024.5.15
      - requests==2.32.3
      - rich==13.7.1
      - safetensors==0.4.3
      - scikit-learn==1.5.1
      - scipy==1.14.0
      - sentence-transformers==3.0.1
      - six==1.16.0
      - sympy==1.13.0
      - threadpoolctl==3.5.0
      - tokenizers==0.19.1
      - torch==2.3.1
      - tqdm==4.66.4
      - transformers==4.40.2
      - triton==2.3.1
      - typing-extensions==4.12.2
      - tzdata==2024.1
      - urllib3==2.2.2
      - xxhash==3.4.1
      - yarl==1.9.4
ThonyPan commented 1 month ago

Thanks again for your response. I will close the issue and try to reproduce using the provided environment.