Closed fuyuchenIfyw closed 2 months ago
Sorry for the delayed reply.
I replied to your email a few hours ago. For convenience, I also reply here:
I checked the degradation code and found that we used a prompt from previous work in the degradation experiment. It is 'Summarize sentence "{text}" in one word:"'. Sorry for the confusion, we forgot to specify the prompt in the degradation section.
But the conclusion is similar no matter what prompt is used, the STS performance is bad at the last layer. Here is a quick experiment for different prompts:
prompt=The representative word for sentence {text} is:" layer=31 +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | 49.39 | 71.19 | 57.81 | 64.50 | 63.22 | 57.51 | 58.83 | 60.35 | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ layer=32 +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | 44.79 | 65.73 | 50.39 | 58.70 | 58.10 | 51.42 | 47.92 | 53.86 | +-------+-------+-------+-------+-------+--------------+-----------------+-------+
prompt=Summarize sentence "{text}" in one word:" layer=31 +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | 57.09 | 75.64 | 67.53 | 72.61 | 74.13 | 70.46 | 70.16 | 69.66 | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ layer=32 +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | 51.18 | 73.74 | 63.13 | 68.87 | 70.96 | 63.29 | 67.45 | 65.52 | +-------+-------+-------+-------+-------+--------------+-----------------+-------+
For your convenience, I have uploaded the latest code for the degradation experiment. You can use it with the latest angle_emb==0.4.5
, please upgrade the angle_emb.
The code is here: https://github.com/4AI/BeLLM/blob/main/eval_degradation.py
Here is an example:
CUDA_VISIBLE_DEVICES=1 python eval_degradation.py --model_name_or_path NousResearch/Llama-2-7b-hf \
--prompt 'Summarize sentence "{text}" in one word:"' \
--pooling_strategy last \
--is_llm 1 \
--layer_index 31
Hello,
I have a question regarding the specific settings used for the Llama-7B model in Figure 4 of the BeLLM paper.
Following the steps in the README, I was able to successfully reproduce the results of the last row in Table 1 of the BeLLM paper. During the installation of the necessary Python packages, I found that
angle-emb==3.1.0
is no longer available. Therefore, I installed the library usingpip install angle-emb
instead ofpip install angle-emb==3.1.0
. As a result, I needed to make the following changes to themodel.py
script:is_llm
parameter in line 233 because the new version ofangle-emb
does not support this option.self.pooling_strategy
to'last'
in line 53.After making these modifications, downloading the model, and running the command provided in step 4 (Evaluation) of the README, I was able to reproduce the results as shown in the paper.
However, I encountered an issue when trying to reproduce the results of the Llama-7B model in Figure 4 of the BeLLM paper. I made the following changes:
BiLLM_START_INDEX
to -1 to disable the bidirectional attention mechanism.lora_name_or_path
to None.apply_lora
parameter to False in line 84 ofeval_sts.py
where the model is defined.I expected these changes to reproduce the vanilla Llama2-7B results on the STS-benchmark. However, my experimental results are approximately 10 points lower than the Llama-7B results at 32 layers shown in Figure 4.
Experimental Results:
Could you please clarify the specific settings used for the Llama-7B model in Figure 4? Any guidance on how to reproduce these results would be greatly appreciated.
Thank you!