Closed ThonyPan closed 1 month ago
Hi @ThonyPan,
can you provide more details about the exact command that you ran?
I tried the following commands on two datasets -
AmazonCounterfactualClassification
python experiments/mteb_eval.py --model_name McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-unsup-simcse --task_name AmazonCounterfactualClassification --task_to_instructions_fp test_configs/mteb/task_to_instructions.json --output_dir results
STS16
python experiments/mteb_eval.py --model_name McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-unsup-simcse --task_name STS16 --task_to_instructions_fp test_configs/mteb/task_to_instructions.json --output_dir results
Here are the results
Dataset | Reported | Reproduced | Difference |
---|---|---|---|
AmazonCounterfactualClassification | 76.94 | 77.82 | +0.88 |
STS16 | 81.91 | 81.89 | -0.02 |
These evals were run on single A100 with default batch size of 32.
We are aware that BF16 causes a lot of discrepancy with batch size, hardware etc. Unfortunately, there issues are related to huggingface and pytorch, hence they cannot be fixed by us. However, so far in our experience, the differences are very marginal.
Let me know if you have any more questions.
Thank you for your response!
I have completed the model evaluation using the exactly same script you mentioned on both 40GB A100 and 3090 GPUs, but unfortunately, I was unable to reproduce the results. I suspect that the discrepancies might be due to differences in the versions of flash attention, torch, mteb, etc. If possible, could you please provide a pip list or conda list? This would greatly help me replicate the environment in my experiments.
@ThonyPan,
Here is the yaml of the conda env that I used for evaluation.
llm2vec was built from source, but the latest version (0.2.0) should also give similar results.
name: /home/llm2vec/.conda
channels:
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- _openmp_mutex=5.1=1_gnu
- bzip2=1.0.8=h5eee18b_6
- ca-certificates=2024.7.2=h06a4308_0
- ld_impl_linux-64=2.38=h1181459_1
- libffi=3.4.4=h6a678d5_1
- libgcc-ng=11.2.0=h1234567_1
- libgomp=11.2.0=h1234567_1
- libstdcxx-ng=11.2.0=h1234567_1
- libuuid=1.41.5=h5eee18b_0
- ncurses=6.4=h6a678d5_0
- openssl=3.0.14=h5eee18b_0
- pip=24.0=py310h06a4308_0
- python=3.10.14=h955ad1f_1
- readline=8.2=h5eee18b_0
- setuptools=69.5.1=py310h06a4308_0
- sqlite=3.45.3=h5eee18b_0
- tk=8.6.14=h39e8969_0
- wheel=0.43.0=py310h06a4308_0
- xz=5.4.6=h5eee18b_1
- zlib=1.2.13=h5eee18b_1
- pip:
- accelerate==0.32.1
- aiohttp==3.9.5
- aiosignal==1.3.1
- annotated-types==0.7.0
- async-timeout==4.0.3
- attrs==23.2.0
- certifi==2024.7.4
- charset-normalizer==3.3.2
- datasets==2.20.0
- dill==0.3.8
- eval-type-backport==0.2.0
- evaluate==0.4.2
- filelock==3.15.4
- frozenlist==1.4.1
- fsspec==2024.5.0
- huggingface-hub==0.23.4
- idna==3.7
- jinja2==3.1.4
- joblib==1.4.2
- markdown-it-py==3.0.0
- markupsafe==2.1.5
- mdurl==0.1.2
- mpmath==1.3.0
- mteb==1.12.75
- multidict==6.0.5
- multiprocess==0.70.16
- networkx==3.3
- numpy==1.26.4
- nvidia-cublas-cu12==12.1.3.1
- nvidia-cuda-cupti-cu12==12.1.105
- nvidia-cuda-nvrtc-cu12==12.1.105
- nvidia-cuda-runtime-cu12==12.1.105
- nvidia-cudnn-cu12==8.9.2.26
- nvidia-cufft-cu12==11.0.2.54
- nvidia-curand-cu12==10.3.2.106
- nvidia-cusolver-cu12==11.4.5.107
- nvidia-cusparse-cu12==12.1.0.106
- nvidia-nccl-cu12==2.20.5
- nvidia-nvjitlink-cu12==12.5.82
- nvidia-nvtx-cu12==12.1.105
- packaging==24.1
- pandas==2.2.2
- peft==0.11.1
- pillow==10.4.0
- polars==1.1.0
- psutil==6.0.0
- pyarrow==16.1.0
- pyarrow-hotfix==0.6
- pydantic==2.8.2
- pydantic-core==2.20.1
- pygments==2.18.0
- python-dateutil==2.9.0.post0
- pytrec-eval-terrier==0.5.6
- pytz==2024.1
- pyyaml==6.0.1
- regex==2024.5.15
- requests==2.32.3
- rich==13.7.1
- safetensors==0.4.3
- scikit-learn==1.5.1
- scipy==1.14.0
- sentence-transformers==3.0.1
- six==1.16.0
- sympy==1.13.0
- threadpoolctl==3.5.0
- tokenizers==0.19.1
- torch==2.3.1
- tqdm==4.66.4
- transformers==4.40.2
- triton==2.3.1
- typing-extensions==4.12.2
- tzdata==2024.1
- urllib3==2.2.2
- xxhash==3.4.1
- yarl==1.9.4
Thanks again for your response. I will close the issue and try to reproduce using the provided environment.
Hi @vaibhavad,
I’ve been trying to reproduce your work using the recently published MTEB evaluation script. However, the results of the tested subtasks using the Mistral-7b-Instruct-v2-nmtp-unsupervised model show discrepancies compared to the results reported in Table 11 of your paper. Specifically, the performance on most tasks is lower than what was reported.
I have observed that when using bf16 for model inference, parameters like batch_size significantly affect the outcomes. Could you please provide a more detailed and feasible tutorial to help reproduce the results as reported in the paper?
Thank you!