Open YanchengWang opened 8 months ago
Hi,
I want to pre-train bge-large-en on my own data. Is there a requirement on the length of each {"text": str} in the pre-training process? And do you have suggestions?
Thanks a lot!
The number of tokens should not exceed 512. The script will truncate the text to 512 tokens, so longer text will not used in training.
Hi,
I want to pre-train bge-large-en on my own data. Is there a requirement on the length of each {"text": str} in the pre-training process? And do you have suggestions?
Thanks a lot!