Open lizhao-8202 opened 6 months ago
现在是想做中文语法检查,但有些特殊场景需要做训练。目前我们的显卡现存是24G
现在是想做中文语法检查,但有些特殊场景需要做训练。目前我们的显卡现存是24G
当时我们使用的是4卡80G,A100的显卡进行的全量微调。 关于LORA我们当时并未进行测试,看起来像是模型没有放到GPU上导致的。建议trainer前确认模型等其他模块的device位置再进行修改。 也欢迎提pull requests,如果你解决了这个问题的话。
现在是想做中文语法检查,但有些特殊场景需要做训练。目前我们的显卡现存是24G
当时我们使用的是4卡80G,A100的显卡进行的全量微调。 关于LORA我们当时并未进行测试,看起来像是模型没有放到GPU上导致的。建议trainer前确认模型等其他模块的device位置再进行修改。 也欢迎提pull requests,如果你解决了这个问题的话。
LORA这块找了好久没找到原因,主要是也不熟悉transformers框架。报错时相关的数据是这样的 weight.device为:device(type='cpu')
weight.format为: col_turing
weight为:tensor([[ -3, -6, -30, ..., 7, 18, 25], [-32, 0, 0, ..., -18, -42, -37], [ 76, 56, -68, ..., 30, 59, 9], ..., [ 18, 10, -3, ..., 9, -15, -12], [ 12, -31, 24, ..., 0, 24, 3], [ -4, 37, 25, ..., 10, -3, -20]], dtype=torch.int8)
报错对应的前后代码块为(报错方法为以下代码倒数第二行的 get_tile_inds(weight_format, weight.device))
weight = state_dict.get(f"{prefix}weight")
if weight is None:
# if the state dict has no weights for this layer (e.g., LoRA finetuning), do nothing
return
weight_format = state_dict.pop(f"{prefix}weight_format", "row")
if weight_format != "row":
tile_indices = get_tile_inds(weight_format, weight.device)
state_dict[f"{prefix}weight"] = undo_layout(weight, tile_indices)
现在是想做中文语法检查,但有些特殊场景需要做训练。目前我们的显卡现存是24G
当时我们使用的是4卡80G,A100的显卡进行的全量微调。 关于LORA我们当时并未进行测试,看起来像是模型没有放到GPU上导致的。建议trainer前确认模型等其他模块的device位置再进行修改。 也欢迎提pull requests,如果你解决了这个问题的话。
LORA这块找了好久没找到原因,主要是也不熟悉transformers框架。报错时相关的数据是这样的 weight.device为:device(type='cpu')
weight.format为: col_turing
weight为:tensor([[ -3, -6, -30, ..., 7, 18, 25], [-32, 0, 0, ..., -18, -42, -37], [ 76, 56, -68, ..., 30, 59, 9], ..., [ 18, 10, -3, ..., 9, -15, -12], [ 12, -31, 24, ..., 0, 24, 3], [ -4, 37, 25, ..., 10, -3, -20]], dtype=torch.int8)
报错对应的前后代码块为(报错方法为以下代码倒数第二行的 get_tile_inds(weight_format, weight.device))
weight = state_dict.get(f"{prefix}weight") if weight is None: # if the state dict has no weights for this layer (e.g., LoRA finetuning), do nothing return weight_format = state_dict.pop(f"{prefix}weight_format", "row") if weight_format != "row": tile_indices = get_tile_inds(weight_format, weight.device) state_dict[f"{prefix}weight"] = undo_layout(weight, tile_indices)
prefix为:base_model.model.transformer.h.0.self_attention.query_key_value.base_layer.
现在是想做中文语法检查,但有些特殊场景需要做训练。目前我们的显卡现存是24G
当时我们使用的是4卡80G,A100的显卡进行的全量微调。 关于LORA我们当时并未进行测试,看起来像是模型没有放到GPU上导致的。建议trainer前确认模型等其他模块的device位置再进行修改。 也欢迎提pull requests,如果你解决了这个问题的话。
LORA这块找了好久没找到原因,主要是也不熟悉transformers框架。报错时相关的数据是这样的 weight.device为:device(type='cpu') weight.format为: col_turing weight为:tensor([[ -3, -6, -30, ..., 7, 18, 25], [-32, 0, 0, ..., -18, -42, -37], [ 76, 56, -68, ..., 30, 59, 9], ..., [ 18, 10, -3, ..., 9, -15, -12], [ 12, -31, 24, ..., 0, 24, 3], [ -4, 37, 25, ..., 10, -3, -20]], dtype=torch.int8) 报错对应的前后代码块为(报错方法为以下代码倒数第二行的 get_tile_inds(weight_format, weight.device))
weight = state_dict.get(f"{prefix}weight") if weight is None: # if the state dict has no weights for this layer (e.g., LoRA finetuning), do nothing return weight_format = state_dict.pop(f"{prefix}weight_format", "row") if weight_format != "row": tile_indices = get_tile_inds(weight_format, weight.device) state_dict[f"{prefix}weight"] = undo_layout(weight, tile_indices)
prefix为:base_model.model.transformer.h.0.self_attention.query_key_value.base_layer.
或者我们有没其他模型可以用我们这代码微调成功(模型小一点,能对中文语意进行内容调整。或者phoenix-inst-chat-7b训练过程中的某个版本是否提供)
现在是想做中文语法检查,但有些特殊场景需要做训练。目前我们的显卡现存是24G
当时我们使用的是4卡80G,A100的显卡进行的全量微调。 关于LORA我们当时并未进行测试,看起来像是模型没有放到GPU上导致的。建议trainer前确认模型等其他模块的device位置再进行修改。 也欢迎提pull requests,如果你解决了这个问题的话。
LORA这块找了好久没找到原因,主要是也不熟悉transformers框架。报错时相关的数据是这样的 weight.device为:device(type='cpu') weight.format为: col_turing weight为:tensor([[ -3, -6, -30, ..., 7, 18, 25], [-32, 0, 0, ..., -18, -42, -37], [ 76, 56, -68, ..., 30, 59, 9], ..., [ 18, 10, -3, ..., 9, -15, -12], [ 12, -31, 24, ..., 0, 24, 3], [ -4, 37, 25, ..., 10, -3, -20]], dtype=torch.int8) 报错对应的前后代码块为(报错方法为以下代码倒数第二行的 get_tile_inds(weight_format, weight.device))
weight = state_dict.get(f"{prefix}weight") if weight is None: # if the state dict has no weights for this layer (e.g., LoRA finetuning), do nothing return weight_format = state_dict.pop(f"{prefix}weight_format", "row") if weight_format != "row": tile_indices = get_tile_inds(weight_format, weight.device) state_dict[f"{prefix}weight"] = undo_layout(weight, tile_indices)
prefix为:base_model.model.transformer.h.0.self_attention.query_key_value.base_layer.
或者我们有没其他模型可以用我们这代码微调成功(模型小一点,能对中文语意进行内容调整。或者phoenix-inst-chat-7b训练过程中的某个版本是否提供)
phoenix是基于bloomz进行微调的,Bloomz的小版本模型(560m,1b,3b)都可以使用。另外,整个代码也并不困难,只是特定数据准备。如果数据准备完成后,可以使用其他框架例如(LLaMAFactory)进行微调。
现在是想做中文语法检查,但有些特殊场景需要做训练。目前我们的显卡现存是24G
当时我们使用的是4卡80G,A100的显卡进行的全量微调。 关于LORA我们当时并未进行测试,看起来像是模型没有放到GPU上导致的。建议trainer前确认模型等其他模块的device位置再进行修改。 也欢迎提pull requests,如果你解决了这个问题的话。
LORA这块找了好久没找到原因,主要是也不熟悉transformers框架。报错时相关的数据是这样的 weight.device为:device(type='cpu') weight.format为: col_turing weight为:tensor([[ -3, -6, -30, ..., 7, 18, 25], [-32, 0, 0, ..., -18, -42, -37], [ 76, 56, -68, ..., 30, 59, 9], ..., [ 18, 10, -3, ..., 9, -15, -12], [ 12, -31, 24, ..., 0, 24, 3], [ -4, 37, 25, ..., 10, -3, -20]], dtype=torch.int8) 报错对应的前后代码块为(报错方法为以下代码倒数第二行的 get_tile_inds(weight_format, weight.device))
weight = state_dict.get(f"{prefix}weight") if weight is None: # if the state dict has no weights for this layer (e.g., LoRA finetuning), do nothing return weight_format = state_dict.pop(f"{prefix}weight_format", "row") if weight_format != "row": tile_indices = get_tile_inds(weight_format, weight.device) state_dict[f"{prefix}weight"] = undo_layout(weight, tile_indices)
prefix为:base_model.model.transformer.h.0.self_attention.query_key_value.base_layer.
或者我们有没其他模型可以用我们这代码微调成功(模型小一点,能对中文语意进行内容调整。或者phoenix-inst-chat-7b训练过程中的某个版本是否提供)
phoenix是基于bloomz进行微调的,Bloomz的小版本模型(560m,1b,3b)都可以使用。另外,整个代码也并不困难,只是特定数据准备。如果数据准备完成后,可以使用其他框架例如(LLaMAFactory)进行微调。
bloomz本身不具备中文语法纠错的能力吧。如果要实现中文语法纠错,基于Bloomz + 特定数据集 用LLaMAFactory微调就可以实现吗。对AI这块知识比较匮乏,不好意思问的比较低级
我试着把finetune.py中user_lora设置为true。运行trainer.train()时抛出Expected a cuda device, but got: cpu。详细日志 Traceback (most recent call last): File "/data/AI/GrammarGPT/GrammarGPT-main/finetune.py", line 319, in
fire.Fire(train)
File "/opt/python3.10/python3/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/python3.10/python3/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/python3.10/python3/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, *kwargs)
File "/data/AI/GrammarGPT/GrammarGPT-main/finetune.py", line 308, in train
trainer.train()
File "/opt/python3.10/python3/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/opt/python3.10/python3/lib/python3.10/site-packages/transformers/trainer.py", line 2049, in _inner_training_loop
self._load_best_model()
File "/opt/python3.10/python3/lib/python3.10/site-packages/transformers/trainer.py", line 2225, in _load_best_model
load_result = model.load_state_dict(state_dict, False)
File "/opt/python3.10/python3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2175, in load_state_dict
load(self, state_dict)
File "/opt/python3.10/python3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2163, in load
load(child, child_state_dict, child_prefix) # noqa: F821
File "/opt/python3.10/python3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2163, in load
load(child, child_state_dict, child_prefix) # noqa: F821
[Previous line repeated 5 more times]
File "/opt/python3.10/python3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2157, in load
module._load_from_state_dict(
File "/opt/python3.10/python3/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 416, in _load_from_state_dict
super()._load_from_state_dict(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys,
File "/opt/python3.10/python3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2012, in _load_from_state_dict
hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
File "/opt/python3.10/python3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 73, in call
return self.hook(args, **kwargs)
File "/opt/python3.10/python3/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 366, in maybe_rearrange_weight
tile_indices = get_tile_inds(weight_format, weight.device)
File "/opt/python3.10/python3/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 247, in get_tile_inds
return get_inverse_transform_indices(transform, _get_tile_size(format)).to(device)
File "/opt/python3.10/python3/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 79, in get_inverse_transform_indices
permuted_tile_i = transform_tile(sample_tile_i)
File "/opt/python3.10/python3/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 245, in
transform = lambda x: F.transform(x.to(device), from_order="row", to_order=format)[0].to(x.device)
File "/opt/python3.10/python3/lib/python3.10/site-packages/bitsandbytes/functional.py", line 2196, in transform
prev_device = pre_call(A.device)
File "/opt/python3.10/python3/lib/python3.10/site-packages/bitsandbytes/functional.py", line 417, in pre_call
torch.cuda.set_device(device)
File "/opt/python3.10/python3/lib/python3.10/site-packages/torch/cuda/init.py", line 397, in set_device
device = _get_device_index(device)
File "/opt/python3.10/python3/lib/python3.10/site-packages/torch/cuda/_utils.py", line 34, in _get_device_index
raise ValueError(f"Expected a cuda device, but got: {device}")
ValueError: Expected a cuda device, but got: cpu
这里如果使用lora的逻辑,是不是有其他参数需要做调整比如resume_from_checkpoint.或是不是通过在train时添加其他参数能避免这报错 另你们当初试验机器的环境配置是啥样的,显存,机器的cpu及内存等。