FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7.44k stars 534 forks source link

A second question about reproducing bge-en-icl #1193

Open greeneggsandyaml opened 1 day ago

greeneggsandyaml commented 1 day ago

Hello authors, thanks for your quick responses on my previous issues!

I'm making a new issue to ask whether these are the right hyperparameters for training the bge-en-icl. I'm finding that I can achieve pretty good performance, but not as good as the official bge-en-icl checkpoint.

I'm using the following hyperparameters (here they are in a yaml file):

# Model arguments
model_name_or_path: /mnt/large_shared/models/Mistral-7B-Instruct-v0.3
cache_dir: /mnt/large_shared/cache/embed_icl/bge_full_model_cache
use_flash_attn: True
token: ...
use_lora: True
lora_alpha: 64
lora_rank: 32
save_merged_lora_model: True
target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - down_proj
  - up_proj

# Data arguments
train_data: /mnt/large_shared/data/bge-full-data
cache_path: /mnt/large_shared/cache/embed_icl/bge_full_data_cache
passage_max_len: 512
query_max_len: 512
example_query_max_len: 256
example_passage_max_len: 256
symmetric_batch_size: 256
total_max_len: 2048
max_class_neg: 7
train_group_size: 8
symmetric_train_group_size: 8
use_special_tokens: True

# Training arguments
run_name: 
output_dir: 
deepspeed: configs/deepspeed/stage1.json
dataloader_drop_last: True
ddp_find_unused_parameters: False
fp16: True
gradient_checkpointing: True
learning_rate: 1.0e-4
logging_steps: 1
negatives_cross_device: True
normlized: True
num_train_epochs: 1
per_device_train_batch_size: 16  # I'm using 4 nodes, so 16 * 8 * 4 = 512 total batch size
save_steps: 1000
save_total_limit: 20
temperature: 0.02
warmup_steps: 100

Do you know what might be the cause of the discrepancy between the performance of my trained model and your pretrained model? I'm happy to provide any more details if you feel that they would be helpful.

(I know I'm using Mistral v0.3 rather than v0.1; I don't think this makes much of a difference although perhaps you know otherwise)

Thanks so much! greeneggsandyaml

545999961 commented 1 day ago

The parameters seem fine. Is 'total_max_len' derived from the query? How is the final model performing?

545999961 commented 1 day ago

I set the target modules as q_proj, k_proj, v_proj, o_proj, down_proj, up_proj, and gate_proj, but this may not make a big difference.