Open CarolXh opened 1 year ago
Seems training in eval mode. check /home/uos/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py, line 354. Maybe you should call model.train() before training
同问:deepspeed调baichuan2同样是在eval步骤报错AttributeError: 'Parameter' object has no attribute 'ds_status'
Traceback (most recent call last):
File "main.py", line 525, in
补充: 看起来是和deepspeed的zero3实现不太兼容,详见https://github.com/microsoft/DeepSpeed/issues/1757 改成zero2就不报错了
Seems training in eval mode. check /home/uos/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py, line 354. Maybe you should call model.train() before training
I use the official fine-tune script to tune my model. The script has called trainer.train() method and I encounter the problem while training. I set the training param eval-strategy=steps and arise the problem. At training steps it works well, while at evaluation steps it interrupts.
推理时也有这样的错误
推理时也有这样的错误
我觉得是包的版本问题,开发者最好是把requirements里面包的版本指定好
可能是的,我特地卸载了torch重装了requirements里要求的版本2.0.0,也不行
同问:deepspeed调baichuan2同样是在eval步骤报错AttributeError: 'Parameter' object has no attribute 'ds_status' Traceback (most recent call last): File "main.py", line 525, in main(run_args) File "main.py", line 419, in main perplexity = evaluation(model, eval_dataloader) File "main.py", line 350, in evaluation outputs = model(batch) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1801, in forward loss = self.module(*inputs, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, *kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Base/modeling_baichuan.py", line 697, in forward logits = self.lm_head(hidden_states) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Base/modeling_baichuan.py", line 508, in forward norm_weight = self.weight File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1605, in getattr return _parameters[name] File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 132, in getitem if param.ds_status == ZeroParamStatus.NOT_AVAILABLE: AttributeError: 'Parameter' object has no attribute 'ds_status'
补充: 看起来是和deepspeed的zero3实现不太兼容,详见microsoft/DeepSpeed#1757 改成zero2就不报错了
同样遇到该问题,DeepSpeed ZeRO3推理时报错,在使用Baichuan1时未出现问题
除了改成zero2,还有其他方法吗,zero2如果全参数训练的话需要的资源很大呀。
推理时也有这样的错误
我觉得是包的版本问题,开发者最好是把requirements里面包的版本指定好
请问后续解决了这个问题吗?(更新包之类的好像没用?
原因在于以下代码里self.weight = nn.Parameter(nn.functional.normalize(self.weight))把deepspeed stage3在parameter里生成的变量给干掉了。
第一版不做head的normalization就没问题。
class NormHead(nn.Module):
def __init__(self, hidden_size, vocab_size, bias=False):
super().__init__()
self.weight = nn.Parameter(torch.empty((vocab_size, hidden_size)))
nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
self.first_flag = True
def forward(self, hidden_states):
if self.training:
norm_weight = nn.functional.normalize(self.weight)
elif self.first_flag:
self.first_flag = False
self.weight = nn.Parameter(nn.functional.normalize(self.weight))
norm_weight = self.weight
else:
norm_weight = self.weight
return nn.functional.linear(hidden_states, norm_weight)
原因在于以下代码里self.weight = nn.Parameter(nn.functional.normalize(self.weight))把deepspeed stage3在parameter里生成的变量给干掉了。
第一版不做head的normalization就没问题。
class NormHead(nn.Module): def __init__(self, hidden_size, vocab_size, bias=False): super().__init__() self.weight = nn.Parameter(torch.empty((vocab_size, hidden_size))) nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5)) self.first_flag = True def forward(self, hidden_states): if self.training: norm_weight = nn.functional.normalize(self.weight) elif self.first_flag: self.first_flag = False self.weight = nn.Parameter(nn.functional.normalize(self.weight)) norm_weight = self.weight else: norm_weight = self.weight return nn.functional.linear(hidden_states, norm_weight)
请问该怎么解决这个问题呢?
原因在于以下代码里self.weight = nn.Parameter(nn.functional.normalize(self.weight))把deepspeed stage3在parameter里生成的变量给干掉了。 第一版不做head的normalization就没问题。
class NormHead(nn.Module): def __init__(self, hidden_size, vocab_size, bias=False): super().__init__() self.weight = nn.Parameter(torch.empty((vocab_size, hidden_size))) nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5)) self.first_flag = True def forward(self, hidden_states): if self.training: norm_weight = nn.functional.normalize(self.weight) elif self.first_flag: self.first_flag = False self.weight = nn.Parameter(nn.functional.normalize(self.weight)) norm_weight = self.weight else: norm_weight = self.weight return nn.functional.linear(hidden_states, norm_weight)
请问该怎么解决这个问题呢?
遇到一样的问题,根据这个issue,这里只是为了加速才重新初始化的nn.Parameter(),改成下面的方式validation就过了:
class NormHead(nn.Module):
def __init__(self, hidden_size, vocab_size, bias=False):
super().__init__()
self.weight = nn.Parameter(torch.empty((vocab_size, hidden_size)))
nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
# self.first_flag = True
def forward(self, hidden_states):
# if self.training:
# norm_weight = nn.functional.normalize(self.weight)
# elif self.first_flag:
# self.first_flag = False
# self.weight = nn.Parameter(nn.functional.normalize(self.weight))
# norm_weight = self.weight
# else:
# norm_weight = self.weight
norm_weight = nn.functional.normalize(self.weight)
return nn.functional.linear(hidden_states, norm_weight)
我是用transformers的trainer类去做的微调训练,每次一到eval的步骤就会报错,信息如下: AttributeError: Caught AttributeError in replica 1 on device 1. Original Traceback (most recent call last): File "/home/uos/miniconda3/envs/llm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, kwargs) File "/home/uos/miniconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/uos/miniconda3/envs/llm/lib/python3.10/site-packages/peft/peft_model.py", line 931, in forward return self.base_model( File "/home/uos/miniconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/uos/miniconda3/envs/llm/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward return self.model.forward(*args, kwargs) File "/home/uos/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, *kwargs) File "/home/uos/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py", line 692, in forward outputs = self.model( File "/home/uos/miniconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/uos/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/uos/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py", line 404, in forward alibi_mask = self.get_alibi_mask(inputs_embeds, seq_length_with_past) File "/home/uos/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py", line 354, in get_alibi_mask mask = self.future_mask[ File "/home/uos/miniconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'BaichuanModel' object has no attribute 'future_mask'
之后我又改用Llam-efficient-tuning用和调baichuan1一样的方法去调baichuan2,使用了deepspeed,同样是在eval步骤出错。报错: AttributeError: 'Parameter' object has no attribute 'ds_status' 求问是什么原因