Pre-training data preparation

FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs

MIT License

6.17k stars 442 forks source link

Pre-training data preparation #260

Open YanchengWang opened 8 months ago

YanchengWang commented 8 months ago

Hi,

I want to pre-train bge-large-en on my own data. Is there a requirement on the length of each {"text": str} in the pre-training process? And do you have suggestions?

Thanks a lot!

staoxiao commented 8 months ago

The number of tokens should not exceed 512. The script will truncate the text to 512 tokens, so longer text will not used in training.