McGill-NLP / llm2vec

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'
https://mcgill-nlp.github.io/llm2vec/
MIT License
816 stars 59 forks source link

What is the purpose of split text with `!@#$%^&*()`? #72

Closed fahadh4ilyas closed 1 month ago

fahadh4ilyas commented 1 month ago

https://github.com/McGill-NLP/llm2vec/blob/785fdb5971e96dcbd8d4e5e5ad5ce1e6bc1afeea/llm2vec/llm2vec.py#L157

This split is only happened when there is exactly that substring inside the string. If you do this split, only this is what happened:

text = 'Here is a text! This text have exclamation mark'
print(text.split("!@#$%^&*()"))
# ['Here is a text! This text have exclamation mark']

I guess the intention is this?

text = 'Here is a text! This text have exclamation mark'
print(do_split(text))
# ['Here is a text', ' This text have exclamation mark']
fahadh4ilyas commented 1 month ago

Nevermind, it's used for E5 datasets. I thought it's a random substring.

vaibhavad commented 1 month ago

It is to separate out instruction tokens and sentence tokens, as while mean pooling, also sentence tokens are considered.