return f"{instruction.strip()} !@#$%^&*(){text}" if instruction else f"!@#$%^&*(){text}"
when instruction is empty, originial_texts and texts_2 become different (because of whitespace; because tokenizer treats 'example' and ' example' differently; but they are intended to be the same) that they are tokenized differently and even it can raise error like:
RuntimeError: The expanded size of the tensor (22) must match the existing size (23) at non-singleton dimension 0.
https://github.com/McGill-NLP/llm2vec/blob/361f77852dbc87ca0a8cd94d0bdcc24be3abb9ea/llm2vec/llm2vec.py#L274
The code above needs to be fixed to:
when instruction is empty,
originial_texts
andtexts_2
become different (because of whitespace; because tokenizer treats 'example' and ' example' differently; but they are intended to be the same) that they are tokenized differently and even it can raise error like:RuntimeError: The expanded size of the tensor (22) must match the existing size (23) at non-singleton dimension 0.
https://github.com/McGill-NLP/llm2vec/blob/361f77852dbc87ca0a8cd94d0bdcc24be3abb9ea/llm2vec/llm2vec.py#L156-L197