Closed fahadh4ilyas closed 1 month ago
https://github.com/McGill-NLP/llm2vec/blob/785fdb5971e96dcbd8d4e5e5ad5ce1e6bc1afeea/llm2vec/llm2vec.py#L157
This split is only happened when there is exactly that substring inside the string. If you do this split, only this is what happened:
text = 'Here is a text! This text have exclamation mark' print(text.split("!@#$%^&*()")) # ['Here is a text! This text have exclamation mark']
I guess the intention is this?
text = 'Here is a text! This text have exclamation mark' print(do_split(text)) # ['Here is a text', ' This text have exclamation mark']
Nevermind, it's used for E5 datasets. I thought it's a random substring.
It is to separate out instruction tokens and sentence tokens, as while mean pooling, also sentence tokens are considered.
https://github.com/McGill-NLP/llm2vec/blob/785fdb5971e96dcbd8d4e5e5ad5ce1e6bc1afeea/llm2vec/llm2vec.py#L157
This split is only happened when there is exactly that substring inside the string. If you do this split, only this is what happened:
I guess the intention is this?