Open ZeyuTeng96 opened 1 year ago
如果把logging_steps改为10以上呢?
如果把logging_steps改为10以上呢?
10以上是肯定会有的,但是问题是bloom config里设置了"gradient_accumulation_steps": 32,意味着每一步的logging都是经历了32个batch,如果这样的话前几个steps没有学习率的话,多少有点不对劲呢
如果把logging_steps改为10以上呢?
有在transformers的issue里面看过类似的,貌似说法是deepspeed config里设置lr、optimizer的问题导致,还有说法是模型之前是bf16,但是现在设置的fp16?
issue如下: https://github.com/huggingface/transformers/issues/14531
如果把logging_steps改为10以上呢?
请问,如果按照官方的bloom config和deepspeed config运行的话,会出现lr = 0的问题嘛?
igscience/bloomz-7b1-mt", "data_path": "data/res/merge_data.json", "output_dir": "trained_models/bloom", "per_device_train_batch_size": 1, "num_epochs": 2, "learning_rate": 1e-5, "cutoff_len": 1024, "val_set_size": 1000, "val_set_rate": 0.1, "save_steps": 1000, "eval_steps": 1000, "logging_steps": 1, "gradient_accumulation_steps": 32 }
deepspeed config为: { "train_batch_size": "auto
您实验的机器是A100嘛,我们实验时并没有遇到lr=0的问题
igscience/bloomz-7b1-mt", "data_path": "data/res/merge_data.json", "output_dir": "trained_models/bloom", "per_device_train_batch_size": 1, "num_epochs": 2, "learning_rate": 1e-5, "cutoff_len": 1024, "val_set_size": 1000, "val_set_rate": 0.1, "save_steps": 1000, "eval_steps": 1000, "logging_steps": 1, "gradient_accumulation_steps": 32 } deepspeed config为: { "train_batch_size": "auto
您实验的机器是A100嘛,我们实验时并没有遇到lr=0的问题
是80G的A100,如果设置logging step 1的话,会出现这种情况嘛?
我们会找时间尝试一下,看看能不能复现这个问题。
------------------ 原始邮件 ------------------ 发件人: "LianjiaTech/BELLE" @.>; 发送时间: 2023年4月9日(星期天) 晚上7:08 @.>; @.**@.>; 主题: Re: [LianjiaTech/BELLE] 出现如下warning: tried to get lr value before scheduler/optimizer started stepping, returning lr=0 (Issue #134)
igscience/bloomz-7b1-mt", "data_path": "data/res/merge_data.json", "output_dir": "trained_models/bloom", "per_device_train_batch_size": 1, "num_epochs": 2, "learning_rate": 1e-5, "cutoff_len": 1024, "val_set_size": 1000, "val_set_rate": 0.1, "save_steps": 1000, "eval_steps": 1000, "logging_steps": 1, "gradient_accumulation_steps": 32 } deepspeed config为: { "train_batch_size": "auto
您实验的机器是A100嘛,我们实验时并没有遇到lr=0的问题
是80G的A100,如果设置logging step 1的话,会出现这种情况嘛?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
我们会找时间尝试一下,看看能不能复现这个问题。 … ------------------ 原始邮件 ------------------ 发件人: "LianjiaTech/BELLE" @.>; 发送时间: 2023年4月9日(星期天) 晚上7:08 @.>; @.**@.>; 主题: Re: [LianjiaTech/BELLE] 出现如下warning: tried to get lr value before scheduler/optimizer started stepping, returning lr=0 (Issue #134) igscience/bloomz-7b1-mt", "data_path": "data/res/merge_data.json", "output_dir": "trained_models/bloom", "per_device_train_batch_size": 1, "num_epochs": 2, "learning_rate": 1e-5, "cutoff_len": 1024, "val_set_size": 1000, "val_set_rate": 0.1, "save_steps": 1000, "eval_steps": 1000, "logging_steps": 1, "gradient_accumulation_steps": 32 } deepspeed config为: { "train_batch_size": "auto 您实验的机器是A100嘛,我们实验时并没有遇到lr=0的问题 是80G的A100,如果设置logging step 1的话,会出现这种情况嘛? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
好嘞,期待反馈
您好,做了如下的实现,其中bloom config为: { "model_type": "bloom", "model_name_or_path": "bigscience/bloom-1b1", "data_path": "data/trans_1.json", "output_dir": "trained_models/bloom", "per_device_train_batch_size": 1, "num_epochs": 2, "learning_rate": 1e-5, "cutoff_len": 1024, "val_set_size": 1000, "val_set_rate": 0.1, "save_steps": 1000, "eval_steps": 1000, "logging_steps": 1, "gradient_accumulation_steps": 32 }
deepspeed config #1 的配置为: { "train_batch_size": "auto", "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "overwrite":true, "gradient_accumulation_steps": "auto", "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }
deepspeed config #2 的配置为(接近官方提供的配置): { "train_batch_size": "auto",
"optimizer": {
"type": "Adam",
"params": {
"lr": "auto",
"betas": [
0.9,
0.999
],
"eps": "auto",
"weight_decay": "auto"
}
},
"overwrite":true, "gradient_accumulation_steps": "auto", "fp16": { "enabled": true, "min_loss_scale": 1, "opt_level": "O2" }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true },
"scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } } }
在单纯使用1b1模型,不使用deepspeed进行微调时,学习率变化如下: {'loss': 2.6999, 'learning_rate': 5.263157894736843e-07, 'epoch': 0.01} {'loss': 2.7946, 'learning_rate': 1.0526315789473685e-06, 'epoch': 0.02} {'loss': 3.1472, 'learning_rate': 1.5789473684210526e-06, 'epoch': 0.03} {'loss': 2.7722, 'learning_rate': 2.105263157894737e-06, 'epoch': 0.04} {'loss': 2.9574, 'learning_rate': 2.631578947368421e-06, 'epoch': 0.05} {'loss': 2.7037, 'learning_rate': 2.631578947368421e-06, 'epoch': 0.07} {'loss': 2.9451, 'learning_rate': 2.631578947368421e-06, 'epoch': 0.08} {'loss': 2.8337, 'learning_rate': 3.157894736842105e-06, 'epoch': 0.09} {'loss': 2.9723, 'learning_rate': 3.6842105263157896e-06, 'epoch': 0.1} {'loss': 3.008, 'learning_rate': 4.210526315789474e-06, 'epoch': 0.11} {'loss': 3.0198, 'learning_rate': 4.736842105263158e-06, 'epoch': 0.12} {'loss': 2.9892, 'learning_rate': 5.263157894736842e-06, 'epoch': 0.13} {'loss': 2.4021, 'learning_rate': 5.789473684210527e-06, 'epoch': 0.14} {'loss': 2.344, 'learning_rate': 5.789473684210527e-06, 'epoch': 0.15} {'loss': 2.4769, 'learning_rate': 6.31578947368421e-06, 'epoch': 0.16} {'loss': 2.2217, 'learning_rate': 6.842105263157896e-06, 'epoch': 0.18} {'loss': 2.4098, 'learning_rate': 6.842105263157896e-06, 'epoch': 0.19} {'loss': 1.9803, 'learning_rate': 7.368421052631579e-06, 'epoch': 0.2} {'loss': 2.1771, 'learning_rate': 7.894736842105265e-06, 'epoch': 0.21} {'loss': 2.4345, 'learning_rate': 8.421052631578948e-06, 'epoch': 0.22} {'loss': 2.4525, 'learning_rate': 8.947368421052632e-06, 'epoch': 0.23} {'loss': 2.585, 'learning_rate': 9.473684210526315e-06, 'epoch': 0.24} {'loss': 2.7307, 'learning_rate': 1e-05, 'epoch': 0.25}
在使用deepspeed config 1 的配置时,学习率变化如下: {'loss': 2.8091, 'learning_rate': 5.263157894736843e-07, 'epoch': 0.01} {'loss': 2.8488, 'learning_rate': 1.0526315789473685e-06, 'epoch': 0.02} {'loss': 2.9292, 'learning_rate': 1.5789473684210526e-06, 'epoch': 0.03} {'loss': 2.8395, 'learning_rate': 2.105263157894737e-06, 'epoch': 0.04} {'loss': 3.1188, 'learning_rate': 2.631578947368421e-06, 'epoch': 0.05} {'loss': 2.9179, 'learning_rate': 3.157894736842105e-06, 'epoch': 0.07} {'loss': 2.8102, 'learning_rate': 3.6842105263157896e-06, 'epoch': 0.08} {'loss': 2.8484, 'learning_rate': 4.210526315789474e-06, 'epoch': 0.09} {'loss': 2.9805, 'learning_rate': 4.736842105263158e-06, 'epoch': 0.1} {'loss': 2.7548, 'learning_rate': 5.263157894736842e-06, 'epoch': 0.11} {'loss': 2.6809, 'learning_rate': 5.789473684210527e-06, 'epoch': 0.12} {'loss': 2.5852, 'learning_rate': 6.31578947368421e-06, 'epoch': 0.13} {'loss': 2.6456, 'learning_rate': 6.842105263157896e-06, 'epoch': 0.14} {'loss': 2.6222, 'learning_rate': 7.368421052631579e-06, 'epoch': 0.15} {'loss': 2.2331, 'learning_rate': 7.894736842105265e-06, 'epoch': 0.16} {'loss': 2.2346, 'learning_rate': 8.421052631578948e-06, 'epoch': 0.18} {'loss': 1.9481, 'learning_rate': 8.947368421052632e-06, 'epoch': 0.19} {'loss': 1.98, 'learning_rate': 9.473684210526315e-06, 'epoch': 0.2} {'loss': 2.2987, 'learning_rate': 1e-05, 'epoch': 0.21}
在使用deepspeed config 2 的配置时,学习率变化如下: tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 2.7112, 'learning_rate': 0, 'epoch': 0.01} 1%|▉ | 2/182 [01:05<1:37:41, 32.56s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 2.9341, 'learning_rate': 0, 'epoch': 0.02} 2%|█▎ | 3/182 [01:37<1:36:59, 32.51s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 3.093, 'learning_rate': 0, 'epoch': 0.03} {'loss': 2.9688, 'learning_rate': 0.0, 'epoch': 0.04} {'loss': 2.9455, 'learning_rate': 2.3540891336663827e-06, 'epoch': 0.05} {'loss': 3.0102, 'learning_rate': 2.3540891336663827e-06, 'epoch': 0.07} {'loss': 3.1245, 'learning_rate': 3.73114300021637e-06, 'epoch': 0.08} {'loss': 2.8258, 'learning_rate': 4.7081782673327655e-06, 'epoch': 0.09} {'loss': 2.9814, 'learning_rate': 5.466025697329025e-06, 'epoch': 0.1} {'loss': 2.5915, 'learning_rate': 5.466025697329025e-06, 'epoch': 0.11} {'loss': 2.8165, 'learning_rate': 6.0852321338827525e-06, 'epoch': 0.12} {'loss': 2.6727, 'learning_rate': 6.60876371636064e-06, 'epoch': 0.13} {'loss': 2.7603, 'learning_rate': 7.062267400999148e-06, 'epoch': 0.14} {'loss': 2.0928, 'learning_rate': 7.46228600043274e-06, 'epoch': 0.15} {'loss': 2.4763, 'learning_rate': 7.820114830995408e-06, 'epoch': 0.16} {'loss': 2.2755, 'learning_rate': 8.143810382095967e-06, 'epoch': 0.18} {'loss': 2.07, 'learning_rate': 8.439321267549136e-06, 'epoch': 0.19} {'loss': 1.9242, 'learning_rate': 8.711164930263437e-06, 'epoch': 0.2} {'loss': 2.0989, 'learning_rate': 8.962852850027021e-06, 'epoch': 0.21} {'loss': 1.9225, 'learning_rate': 9.197168697545394e-06, 'epoch': 0.22} {'loss': 1.766, 'learning_rate': 9.416356534665531e-06, 'epoch': 0.23} {'loss': 2.6338, 'learning_rate': 9.416356534665531e-06, 'epoch': 0.24} {'loss': 2.7871, 'learning_rate': 9.622251858852542e-06, 'epoch': 0.25} {'loss': 3.1649, 'learning_rate': 9.816375134099122e-06, 'epoch': 0.26} {'loss': 2.9512, 'learning_rate': 1e-05, 'epoch': 0.27}
看了一下trainer的default优化器貌似是adamw,但是官方提供的deepspeed配置文件里的优化器type为adam。其次,deepspeed配置文件里如果加入fp16和lr scheduler的话,就会存在前几个step学习率为0的情况。 @xianghuisun
还请问您们一下,如果使用bloom做指令微调的话,是需要对bloom-7b1的模型的词表进行扩充嘛? @xianghuisun
这学习率 越学越大?
这学习率 越学越大?
warmup_lr啊
这学习率 越学越大?
warmup_lr啊
大佬,我是一直lr是0,这个会是什么原因导致的?一直有这个warning
这学习率 越学越大?
warmup_lr啊
大佬,我是一直lr是0,这个会是什么原因导致的?一直有这个warning
把deepspeed的config里面fp16和lr scheduler配置去掉,optimizer改adamw试试,按照我的配置试试
这学习率 越学越大?
warmup_lr啊
大佬,我是一直lr是0,这个会是什么原因导致的?一直有这个warning
把deepspeed的config里面fp16和lr scheduler配置去掉,optimizer改adamw试试,按照我的配置试试
这些配置试过了,会有同样的问题,我甚至没有开warmup, 用的bf16,多机多卡,目前的问题是,不确定到底多少个steps lr能够跳出0,有的时候很快就跳出0了,有的时候要几百个steps,有的时候就一直不跳出0。 而且之前finetune其他模型没有遇到过这个问题.... 会不会是硬件或者环境有问题
这学习率 越学越大?
warmup_lr啊
大佬,我是一直lr是0,这个会是什么原因导致的?一直有这个warning
把deepspeed的config里面fp16和lr scheduler配置去掉,optimizer改adamw试试,按照我的配置试试
这些配置试过了,会有同样的问题,我甚至没有开warmup, 用的bf16,多机多卡,目前的问题是,不确定到底多少个steps lr能够跳出0,有的时候很快就跳出0了,有的时候要几百个steps,有的时候就一直不跳出0。 而且之前finetune其他模型没有遇到过这个问题.... 会不会是硬件或者环境有问题
感觉不是硬件或者环境问题,我这个issue里面贴了一个transformers的issues。出现这种问题有可能是bloom这个模型在预训练的时候用的参数导致。可能是这种情况,我也不是很确定,希望官方有空能验证一下,找出问题
这学习率 越学越大?
warmup_lr啊
大佬,我是一直lr是0,这个会是什么原因导致的?一直有这个warning
把deepspeed的config里面fp16和lr scheduler配置去掉,optimizer改adamw试试,按照我的配置试试
这些配置试过了,会有同样的问题,我甚至没有开warmup, 用的bf16,多机多卡,目前的问题是,不确定到底多少个steps lr能够跳出0,有的时候很快就跳出0了,有的时候要几百个steps,有的时候就一直不跳出0。 而且之前finetune其他模型没有遇到过这个问题.... 会不会是硬件或者环境有问题
感觉不是硬件或者环境问题,我这个issue里面贴了一个transformers的issues。出现这种问题有可能是bloom这个模型在预训练的时候用的参数导致。可能是这种情况,我也不是很确定,希望官方有空能验证一下,找出问题
我这边用的llama,也是这个问题。 huggingface 报这个warning的地方的说明,但是我用的bf16,zero2也是报这个warning
# not run for the first few dozen steps while loss scale is too large, and thus during
# that time `get_last_lr` will fail if called during that warm up stage, so work around it:
有解决方案嘛?兄弟 @HalcyonLiang
有解决方案嘛?兄弟 @HalcyonLiang
我没探究根本原因,只是对比了下不同的配置,用其他配置代替了避免了这个问题 7B 8张A100不用开zero就能训练,没有这个问题, 7B 16张A100 zero2 不开optimizor offload 没有这个问题 13B 16张A100 zero3 不开optimizor和params的offload 没有这个问题 13B 24张A100 zero2 不开optimizor offload 存在有这个问题 (显像看是多卡分割gradient的时候,显存占用差的有些多,要等分配差不多均匀后,LR才会开始逐渐开始warmup的过程) 有时间的话,可以再多测试下,供参考
您显卡是真的多,牛逼
我最近在用peft lora微调llama-7b-hf的时候也遇到了这个问题,最后发现是库版本的问题,把transformers降级到4.28.0,deepspeed降级到0.8.3就解决了。
我最近在用peft lora微调llama-7b-hf的时候也遇到了这个问题,最后发现是库版本的问题,把transformers降级到4.28.0,deepspeed降级到0.8.3就解决了。
谢谢了!我用你的方法成功了!
Perhaps the batch size is set so large that it lead to “CUDA out of memory”, but the program does not report an error. Try to make the ”train_micro_batch_size_per_gpu“ parameter smaller, Here's what I tried:
train_micro_batch_size_per_gpu = 4 gradient_accumulation_steps = 1 it failed,returning lr=0
train_micro_batch_size_per_gpu = 1 gradient_accumulation_steps = 4 it worked
我最近在用peft lora微调llama-7b-hf的时候也遇到了这个问题,最后发现是库版本的问题,把transformers降级到4.28.0,deepspeed降级到0.8.3就解决了。
感谢,我把transformers降级到4.28.0,deepspeed保持在0.12.6,也解决了这个问题
我是在使用deepspeed微调flant5系列模型时遇到的该问题,lr一直为0,上述方法只有对Transformers版本降级有效,且deepspeed不需要降级;transformers==4.40 --> 4.28.1, deepspeed=0.9.3
不知是 feature 还是 bug [/doge]
https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_pt_utils.py#+L912
def _get_learning_rate(self):
if self.is_deepspeed_enabled:
# with deepspeed's fp16 and dynamic loss scale enabled the optimizer/scheduler steps may
# not run for the first few dozen steps while loss scale is too large, and thus during
# that time `get_last_lr` will fail if called during that warm up stage, so work around it:
try:
last_lr = self.lr_scheduler.get_last_lr()[0]
except AssertionError as e:
if "need to call step" in str(e):
logger.warning("tried to get lr value before scheduler/optimizer started stepping, returning lr=0")
last_lr = 0
else:
raise
else:
if isinstance(self.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
last_lr = self.optimizer.param_groups[0]["lr"]
else:
last_lr = self.lr_scheduler.get_last_lr()[0]
if torch.is_tensor(last_lr):
last_lr = last_lr.item()
return last_lr
您好,在使用finetune脚本使用指令微调数据集微调bloom-7b模型时前几个step出现:
tried to get lr value before scheduler/optimizer started stepping, returning lr=0
这个warning是什么原因呢?
bloom config为: { "model_type": "bloom", "model_name_or_path": "bigscience/bloomz-7b1-mt", "data_path": "data/res/merge_data.json", "output_dir": "trained_models/bloom", "per_device_train_batch_size": 1, "num_epochs": 2, "learning_rate": 1e-5, "cutoff_len": 1024, "val_set_size": 1000, "val_set_rate": 0.1, "save_steps": 1000, "eval_steps": 1000, "logging_steps": 1, "gradient_accumulation_steps": 32 }
deepspeed config为: { "train_batch_size": "auto",
"overwrite":true, "gradient_accumulation_steps": "auto", "fp16": { "enabled": true, "min_loss_scale": 1, "opt_level": "O2" }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true },
"scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } } }