RuntimeError: expected mat1 and mat2 to have the same dtype

vincent507cpu commented 6 months ago

错误信息：

Traceback (most recent call last):
  File "/home/zwj/GitHub/xtuner-main/xtuner/tools/train.py", line 360, in <module>
    main()
  File "/home/zwj/GitHub/xtuner-main/xtuner/tools/train.py", line 356, in main
    runner.train()
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/runner.py", line 1777, in train
    model = self.train_loop.run()  # type: ignore
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/loops.py", line 271, in run
    self.runner.call_hook('before_train')
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/runner.py", line 1839, in call_hook
    getattr(hook, fn_name)(self, **kwargs)
  File "/home/zwj/GitHub/xtuner-main/xtuner/engine/hooks/evaluate_chat_hook.py", line 230, in before_train
    self._generate_samples(runner, max_new_tokens=50)
  File "/home/zwj/GitHub/xtuner-main/xtuner/engine/hooks/evaluate_chat_hook.py", line 216, in _generate_samples
    self._eval_images(runner, model, device, max_new_tokens,
  File "/home/zwj/GitHub/xtuner-main/xtuner/engine/hooks/evaluate_chat_hook.py", line 148, in _eval_images
    generation_output = model.generate(
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/transformers/generation/utils.py", line 1622, in generate
    result = self._sample(
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/transformers/generation/utils.py", line 2791, in _sample
    outputs = self(
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1230, in forward
    logits = self.lm_head(hidden_states)
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/zwj/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half

显卡：V100
系统：Ubuntu 16.04
显卡驱动：515.65.01
cuda版本：11.7
python版本：3.10
训练模型：LLava（Meta-Llama-3-8B-Instruct + clip-vit-large-patch14-336）

非常感谢！

LZHgrla commented 6 months ago

@vincent507cpu 请问训练命令是什么？

vincent507cpu commented 6 months ago

@LZHgrla 换了台不同显卡的服务器没有这个问题。 xtuner train ./xtuner/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/pretrain/llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain.py --deepspeed deepspeed_zero2 > output.log

lxtGH commented 3 months ago

This issue is raised by without deepspeed command.

tcxia commented 2 months ago

@lxtGH The training is fine, but the test has such a problem, how to solve it？

InternLM / xtuner

RuntimeError: expected mat1 and mat2 to have the same dtype #665