Open PierreColombo opened 7 months ago
If i add:
--unified_finetuning True
i got the error: Parameter at index 387 with name model.encoder.layer.23.output.LayerNorm.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this i teration.
When using gradient_checkpointing, you need to enable the deepspeed: --deepspeed ./df_config.json
(df_config.json can refer to ds_config.json).
Hey, This does not work. With deepspeed grad norm is 0 for the first steps.
And without deepspeed accelerate launch \ -m run \ --output_dir $SUFFIX \ --model_name_or_path bge-m3 \ --train_data toto \ --learning_rate $LR \ --fp16 \ --save_steps 0.01 \ --num_train_epochs $EPOCHS \ --per_device_train_batch_size $BS \ --dataloader_drop_last True \ --normlized True \ --temperature $TEMP \ --query_max_len $QUERY_ML \ --passage_max_len $PASSAGE_ML \ --train_group_size $TRAIN_GS \ --negatives_cross_device \ --logging_steps $LOGGING \ --same_task_within_batch True \ --enable_sub_batch False
This returns infinity / nan as gradient. Can you guys help us to provide some default parameters ? Cheers, Pierre
{'loss': 0.7315, 'grad_norm': nan, 'learning_rate': 1e-06, 'epoch': 0.0}
{'loss': 0.9958, 'grad_norm': inf, 'learning_rate': 1e-06, 'epoch': 0.01}
{'loss': 0.6115, 'grad_norm': 17.611980438232422, 'learning_rate': 9.990566037735847e-07, 'epoch': 0.01}
Hello, While finetuning m3
EPOCHS=5 BS=64 TEMP=0.02 QUERY_ML=512 PASSAGE_ML=1024 TRAIN_GS=2 LOGGING=10 LR=1e-5 SUFFIX=test
accelerate launch \ -m run \ --output_dir $SUFFIX \ --model_name_or_path bge-m3 \ --train_data train_bge_hn \ --learning_rate $LR \ --gradient_checkpointing \ --fp16 \ --num_train_epochs $EPOCHS \ --per_device_train_batch_size $BS \ --dataloader_drop_last True \ --normlized True \ --temperature $TEMP \ --query_max_len $QUERY_ML \ --passage_max_len $PASSAGE_ML \ --train_group_size $TRAIN_GS \ --negatives_cross_device \ --logging_steps $LOGGING \ --same_task_within_batch True
I got
File "python3.11/site-packages/accelerate/utils/operations.py", line 813, in call
return convert_to_fp32(self.model_forward(*args, *kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "bge_m3/modeling.py", line 255, in forward
targets = idxs (p_sparse_vecs.size(0) // q_sparse_vecs.size(0))
Any clues what I'm doing wrong ? Thanks