confused with the pooling strategy?

Hi, I'm confused with the pooling strategy you used here.

For training, you use the avg token https://github.com/4AI/BeLLM/blob/9da9269e51d462535964d9bf82aaa14fa3ff6d7c/README.md?plain=1#L52

While for evaluation, you are not specifing any pooling flag here, https://github.com/4AI/BeLLM/blob/9da9269e51d462535964d9bf82aaa14fa3ff6d7c/README.md?plain=1#L99-L105 so this should be default value [cls], right? https://github.com/4AI/BeLLM/blob/9da9269e51d462535964d9bf82aaa14fa3ff6d7c/eval_sts.py#L57

As for the paper, you mentioned that you used the representative word as the pivot, so this should be the last non-padding token, right? So I'm wondering which token should I use or does it make no difference in a decoder based model like llama?

4AI / BeLLM

confused with the pooling strategy? #5