McGill-NLP / llm2vec

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'
https://mcgill-nlp.github.io/llm2vec/
MIT License
1.31k stars 95 forks source link

fix `_convert_to_str` to avoid tokenization issue #107

Closed bzantium closed 4 months ago

bzantium commented 4 months ago

https://github.com/McGill-NLP/llm2vec/blob/361f77852dbc87ca0a8cd94d0bdcc24be3abb9ea/llm2vec/llm2vec.py#L274

The code above needs to be fixed to:

return f"{instruction.strip()} !@#$%^&*(){text}" if instruction else f"!@#$%^&*(){text}"

when instruction is empty, originial_texts and texts_2 become different (because of whitespace; because tokenizer treats 'example' and ' example' differently; but they are intended to be the same) that they are tokenized differently and even it can raise error like: RuntimeError: The expanded size of the tensor (22) must match the existing size (23) at non-singleton dimension 0.

https://github.com/McGill-NLP/llm2vec/blob/361f77852dbc87ca0a8cd94d0bdcc24be3abb9ea/llm2vec/llm2vec.py#L156-L197