Open Alex357853 opened 5 months ago
hi @Alex357853 thanks for following our work. Since ESE is under review, we didn't provide many details.
1) For UAE, you can try to increase ibn_w
to 20
, and evaluate it with cls_avg
pooling (training with cls
).
2) For Qwen, we use bi-directional LLMs, i.e., removing the causal mask of LLMs. For more details, you can refer to this documentation: https://angle.readthedocs.io/en/latest/notes/training.html#angle-trainer-recommended (in 3. Examples / b.LLaMA-based) Specifically, we set --apply_billm 1
, --billm_model_class Qwen2ForCausalLM
, --load_kbit 8
, and set --epochs 2
. I've uploaded the evaluation script here and made the ese-qwen weight public. You can have a try to evaluate the public model and check whether the evaluation works as expected.
The evaluation script is as follows:
BiLLM_START_INDEX=0 CUDA_VISIBLE_DEVICES=0 python eval_ese_nli.py --pooling_strategy avg --model_name_or_path Qwen/Qwen1.5-0.5B --lora_weight WhereIsAI/ese-qwen-0.5b-nli --billm_model_class Qwen2ForCausalLM
BTW, you can try to increase the gradient_accumulation_steps
to x
-times gpu_counts. It might help improve performance further.
hi @Alex357853 thanks for following our work. Since ESE is under review, we didn't provided many details.
- For UAE, you can try to increase
ibn_w
to20
, and evaluate it withcls_avg
pooling (training withcls
).- For Qwen, we use bi-directional LLMs, i.e., removing the causal mask of LLMs. For more details, you can refer to this documentation: https://angle.readthedocs.io/en/latest/notes/training.html#angle-trainer-recommended (in 3. Examples / b.LLaMA-based) Specifically, we set
--apply_billm 1
,--load_kbit 8
, and set--epochs 2
. I've uploaded the evaluation script here and made the ese-qwen weight public. You can have a try to evaluate the public model and check whether the evaluation works as expected.The evaluation script is as follows:
BiLLM_START_INDEX=0 CUDA_VISIBLE_DEVICES=0 python eval_nli.py --pooling_strategy avg --model_name_or_path Qwen/Qwen1.5-0.5B --lora_weight WhereIsAI/ese-qwen-0.5b-nli --billm_model_class Qwen2ForCausalLM
BTW, you can try to increase the
gradient_accumulation_steps
tox
-times gpu_counts. It might help improve performance further.
BTW, you can have a try using the newly released Qwen/Qwen2-0.5B
, it might boost the performance further.
Hi @SeanLee97, thanks for your prompt reply! I am still struggling with the code. I noticed that your trainer can not train using the "last pooling" strategy. The potential bug I found is in https://github.com/SeanLee97/AnglE/blob/191ca1beeb430082226ca2af23fdc451e7643807/angle_emb/angle.py#L667-L672 For example, after https://github.com/SeanLee97/AnglE/blob/191ca1beeb430082226ca2af23fdc451e7643807/angle_emb/angle.py#L661-L666 we already get features['attention_mask'] = tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]). However, after lines L667-L692, it becomes tensor([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643]). I think this may affect the model's performance at the beginning, including other pooling strategies. Could you please clarify whether this is an issue in your code? Thank you for your time and help!
@Alex357853 Thank you for reporting this issue! It is indeed a bug.
It uses the pad token to pad attention mask, however, the pad token is 151643
not 0
in Qwen: https://huggingface.co/Qwen/Qwen2-0.5B-Instruct/blob/main/tokenizer_config.json#L36
I am fixing this issue on the PR: https://github.com/SeanLee97/AnglE/pull/89
Thank you again!
Hi, this is a really good and useful codebase. I tried to reproduce the results reported in the paper but failed. I used the code in
README_ESE.md
:I also change
--cosine_w 0.
to--cosine_w 1.0
and--ibn_w 10.0
to--ibn_w 35.0
, but the results were even worse.WhereIsAI/UAE-Large-V1
model, the results are:This means fine-tuning gave me worse performance. In addition, I noticed that the more epochs I train, the worse the performance gets. Besides, I also tried the code in
examples/NLI/README.md
to trainQwen1.5-0.5B
:It gave me an average score of 70.23, whereas the paper reports 82.82.
I wonder whether these scripts are the ones you used to train your model, especially regarding the parameter values. It would be really helpful if you could assist me in reproducing the results so I can use this codebase. I really appreciate your time and help! Thank you!