Open rxqy opened 4 months ago
Hi, I'm confused with the pooling strategy you used here.
For training, you use the avg token https://github.com/4AI/BeLLM/blob/9da9269e51d462535964d9bf82aaa14fa3ff6d7c/README.md?plain=1#L52
While for evaluation, you are not specifing any pooling flag here, https://github.com/4AI/BeLLM/blob/9da9269e51d462535964d9bf82aaa14fa3ff6d7c/README.md?plain=1#L99-L105 so this should be default value [cls], right? https://github.com/4AI/BeLLM/blob/9da9269e51d462535964d9bf82aaa14fa3ff6d7c/eval_sts.py#L57
As for the paper, you mentioned that you used the representative word as the pivot, so this should be the last non-padding token, right? So I'm wondering which token should I use or does it make no difference in a decoder based model like llama?
Hi, I'm confused with the pooling strategy you used here.
For training, you use the avg token https://github.com/4AI/BeLLM/blob/9da9269e51d462535964d9bf82aaa14fa3ff6d7c/README.md?plain=1#L52
While for evaluation, you are not specifing any pooling flag here, https://github.com/4AI/BeLLM/blob/9da9269e51d462535964d9bf82aaa14fa3ff6d7c/README.md?plain=1#L99-L105 so this should be default value [cls], right? https://github.com/4AI/BeLLM/blob/9da9269e51d462535964d9bf82aaa14fa3ff6d7c/eval_sts.py#L57
As for the paper, you mentioned that you used the representative word as the pivot, so this should be the last non-padding token, right? So I'm wondering which token should I use or does it make no difference in a decoder based model like llama?