Open sumitsoman opened 3 months ago
You can set --save_steps 10000
to change the save steps. For more hyper-parameters, you can refer to https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_steps
This script can be used for all bge embedding models.
I try it with bge-m3 model and it gives DataCollatorForWholeWordMask is only suitable for BertTokenizer-like tokenizers. Are there other changes to be made?
I try it with bge-m3 model and it gives DataCollatorForWholeWordMask is only suitable for BertTokenizer-like tokenizers. Are there other changes to be made?
hello, i want to ask if you have solved the problem, i meet the same error, if you have solved the problem, can you share me the solution?
I am using a jsonl file of 1.6M lines to pre-train bge-mw using
torchrun --nproc_per_node 1 \ -m FlagEmbedding.baai_general_embedding.retromae_pretrain.run \ --output_dir ./my_output_folder \ --model_name_or_path BAAI/bge-large-en \ --train_data ./data.jsonl \ --learning_rate 2e-5 \ --num_train_epochs 2 \ --per_device_train_batch_size 4 \ --dataloader_drop_last True \ --max_seq_length 512 \ --logging_steps 10 \ --dataloader_num_workers 12
This creates several checkpoints in the output folder and it runs out of disk space. I tried changing logging_steps to 100000 but it still seems to create checkpoints at every 1k steps. How can this be resolved? Are there any other parameters to change?
Secondly, this does not seem to work for other bge models, is there a specific list for which this will only work?