FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7.55k stars 547 forks source link

Cannot finetune m3 #657

Open PierreColombo opened 7 months ago

PierreColombo commented 7 months ago

Hello, While finetuning m3

EPOCHS=5 BS=64 TEMP=0.02 QUERY_ML=512 PASSAGE_ML=1024 TRAIN_GS=2 LOGGING=10 LR=1e-5 SUFFIX=test

accelerate launch \ -m run \ --output_dir $SUFFIX \ --model_name_or_path bge-m3 \ --train_data train_bge_hn \ --learning_rate $LR \ --gradient_checkpointing \ --fp16 \ --num_train_epochs $EPOCHS \ --per_device_train_batch_size $BS \ --dataloader_drop_last True \ --normlized True \ --temperature $TEMP \ --query_max_len $QUERY_ML \ --passage_max_len $PASSAGE_ML \ --train_group_size $TRAIN_GS \ --negatives_cross_device \ --logging_steps $LOGGING \ --same_task_within_batch True

I got

File "python3.11/site-packages/accelerate/utils/operations.py", line 813, in call
return convert_to_fp32(self.model_forward(*args, *kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(
args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "bge_m3/modeling.py", line 255, in forward
targets = idxs
(p_sparse_vecs.size(0) // q_sparse_vecs.size(0))

Any clues what I'm doing wrong ? Thanks

PierreColombo commented 7 months ago

If i add:

--unified_finetuning True

i got the error: Parameter at index 387 with name model.encoder.layer.23.output.LayerNorm.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this i teration.

staoxiao commented 7 months ago

When using gradient_checkpointing, you need to enable the deepspeed: --deepspeed ./df_config.json (df_config.json can refer to ds_config.json).

PierreColombo commented 7 months ago

Hey, This does not work. With deepspeed grad norm is 0 for the first steps.

And without deepspeed accelerate launch \ -m run \ --output_dir $SUFFIX \ --model_name_or_path bge-m3 \ --train_data toto \ --learning_rate $LR \ --fp16 \ --save_steps 0.01 \ --num_train_epochs $EPOCHS \ --per_device_train_batch_size $BS \ --dataloader_drop_last True \ --normlized True \ --temperature $TEMP \ --query_max_len $QUERY_ML \ --passage_max_len $PASSAGE_ML \ --train_group_size $TRAIN_GS \ --negatives_cross_device \ --logging_steps $LOGGING \ --same_task_within_batch True \ --enable_sub_batch False

This returns infinity / nan as gradient. Can you guys help us to provide some default parameters ? Cheers, Pierre

PierreColombo commented 7 months ago

{'loss': 0.7315, 'grad_norm': nan, 'learning_rate': 1e-06, 'epoch': 0.0}
{'loss': 0.9958, 'grad_norm': inf, 'learning_rate': 1e-06, 'epoch': 0.01}
{'loss': 0.6115, 'grad_norm': 17.611980438232422, 'learning_rate': 9.990566037735847e-07, 'epoch': 0.01}