Ucas-HaoranWei / Vary

[ECCV2024] Official code implementation of Vary: Scaling Up the Vision Vocabulary of Large Vision Language Models.
1.65k stars 150 forks source link

关于Vary tiny无法复现的问题 #92

Closed luohao123 closed 3 months ago

luohao123 commented 3 months ago

作者您好,想请教两个问题:

  1. Opt125m 的maxlength是2048,你训练的是4096,我试过了直接加载是会报错的,修改opt的max length为4096会无法加载权重,只有将长度改为2048才能训练,我想问您是如何训练的?还请说明一下要更改哪些地方;
  2. 我改为2048之后,训练依旧报错了:
/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
    raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
    raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

fp16下为何无法正常训练,大概在几千次step之后,就报这个错,loss益处。

Ucas-HaoranWei commented 3 months ago

1.opt需要插值到4096,插值位置编码 2.我用的bf16训练,没用过fp16,按理说fp16精度高,loss不会溢出,感觉需要查一下数据

luohao123 commented 3 months ago

@Ucas-HaoranWei 代码里面貌似没有看到差值的相关操作,具体要怎么个差值呀?

Ucas-HaoranWei commented 3 months ago

插值opt模型的位置编码