pre-training bge-m3 with large text corpus

sumitsoman commented 3 months ago

I am using a jsonl file of 1.6M lines to pre-train bge-mw using

torchrun --nproc_per_node 1 \ -m FlagEmbedding.baai_general_embedding.retromae_pretrain.run \ --output_dir ./my_output_folder \ --model_name_or_path BAAI/bge-large-en \ --train_data ./data.jsonl \ --learning_rate 2e-5 \ --num_train_epochs 2 \ --per_device_train_batch_size 4 \ --dataloader_drop_last True \ --max_seq_length 512 \ --logging_steps 10 \ --dataloader_num_workers 12

This creates several checkpoints in the output folder and it runs out of disk space. I tried changing logging_steps to 100000 but it still seems to create checkpoints at every 1k steps. How can this be resolved? Are there any other parameters to change?

Secondly, this does not seem to work for other bge models, is there a specific list for which this will only work?

staoxiao commented 3 months ago

You can set --save_steps 10000 to change the save steps. For more hyper-parameters, you can refer to https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_steps

This script can be used for all bge embedding models.

sumitsoman commented 3 months ago

I try it with bge-m3 model and it gives DataCollatorForWholeWordMask is only suitable for BertTokenizer-like tokenizers. Are there other changes to be made?

bairuifengmaggie commented 3 months ago

I try it with bge-m3 model and it gives DataCollatorForWholeWordMask is only suitable for BertTokenizer-like tokenizers. Are there other changes to be made?

hello, i want to ask if you have solved the problem, i meet the same error, if you have solved the problem, can you share me the solution?

FlagOpen / FlagEmbedding

pre-training bge-m3 with large text corpus #645