Clarification on Settings for Llama-7B Model in Figure 4 of BeLLM Paper

Hello,

I have a question regarding the specific settings used for the Llama-7B model in Figure 4 of the BeLLM paper.

Following the steps in the README, I was able to successfully reproduce the results of the last row in Table 1 of the BeLLM paper. During the installation of the necessary Python packages, I found that angle-emb==3.1.0 is no longer available. Therefore, I installed the library using pip install angle-emb instead of pip install angle-emb==3.1.0. As a result, I needed to make the following changes to the model.py script:

Removed the is_llm parameter in line 233 because the new version of angle-emb does not support this option.
Set self.pooling_strategy to 'last' in line 53.

After making these modifications, downloading the model, and running the command provided in step 4 (Evaluation) of the README, I was able to reproduce the results as shown in the paper.

However, I encountered an issue when trying to reproduce the results of the Llama-7B model in Figure 4 of the BeLLM paper. I made the following changes:

Set BiLLM_START_INDEX to -1 to disable the bidirectional attention mechanism.
Set lora_name_or_path to None.
Set the apply_lora parameter to False in line 84 of eval_sts.py where the model is defined.

I expected these changes to reproduce the vanilla Llama2-7B results on the STS-benchmark. However, my experimental results are approximately 10 points lower than the Llama-7B results at 32 layers shown in Figure 4.

Experimental Results: 微信图片_20240611214933

Could you please clarify the specific settings used for the Llama-7B model in Figure 4? Any guidance on how to reproduce these results would be greatly appreciated.

Thank you!

Sorry for the delayed reply.

I replied to your email a few hours ago. For convenience, I also reply here:

I checked the degradation code and found that we used a prompt from previous work in the degradation experiment. It is 'Summarize sentence "{text}" in one word:"'. Sorry for the confusion, we forgot to specify the prompt in the degradation section.

But the conclusion is similar no matter what prompt is used, the STS performance is bad at the last layer. Here is a quick experiment for different prompts:

prompt=The representative word for sentence {text} is:" layer=31 +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | 49.39 | 71.19 | 57.81 | 64.50 | 63.22 | 57.51 | 58.83 | 60.35 | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ layer=32 +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | 44.79 | 65.73 | 50.39 | 58.70 | 58.10 | 51.42 | 47.92 | 53.86 | +-------+-------+-------+-------+-------+--------------+-----------------+-------+
prompt=Summarize sentence "{text}" in one word:" layer=31 +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | 57.09 | 75.64 | 67.53 | 72.61 | 74.13 | 70.46 | 70.16 | 69.66 | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ layer=32 +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | 51.18 | 73.74 | 63.13 | 68.87 | 70.96 | 63.29 | 67.45 | 65.52 | +-------+-------+-------+-------+-------+--------------+-----------------+-------+

For your convenience, I have uploaded the latest code for the degradation experiment. You can use it with the latest angle_emb==0.4.5, please upgrade the angle_emb.

The code is here: https://github.com/4AI/BeLLM/blob/main/eval_degradation.py

Here is an example:

CUDA_VISIBLE_DEVICES=1 python eval_degradation.py --model_name_or_path NousResearch/Llama-2-7b-hf \
--prompt 'Summarize sentence "{text}" in one word:"' \
--pooling_strategy last \
--is_llm 1 \
--layer_index 31

4AI / BeLLM

Clarification on Settings for Llama-7B Model in Figure 4 of BeLLM Paper #4