fix `_convert_to_str` to avoid tokenization issue

https://github.com/McGill-NLP/llm2vec/blob/361f77852dbc87ca0a8cd94d0bdcc24be3abb9ea/llm2vec/llm2vec.py#L274

The code above needs to be fixed to:

return f"{instruction.strip()} !@#$%^&*(){text}" if instruction else f"!@#$%^&*(){text}"

when instruction is empty, originial_texts and texts_2 become different (because of whitespace; because tokenizer treats 'example' and ' example' differently; but they are intended to be the same) that they are tokenized differently and even it can raise error like: RuntimeError: The expanded size of the tensor (22) must match the existing size (23) at non-singleton dimension 0.

https://github.com/McGill-NLP/llm2vec/blob/361f77852dbc87ca0a8cd94d0bdcc24be3abb9ea/llm2vec/llm2vec.py#L156-L197

McGill-NLP / llm2vec

fix `_convert_to_str` to avoid tokenization issue #107