X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
https://www.modelscope.cn/studios/damo/mPLUG-Owl
MIT License
2.25k stars 171 forks source link

RuntimeError: "conv_depthwise3d" not implemented for 'BFloat16' #129

Closed zhouwei5113 closed 1 year ago

zhouwei5113 commented 1 year ago

I've tried to add training code based on video pretrained model, but got the following error. So how to fix this?

raceback (most recent call last): File "/workspace/nas-data/2023/xPLUG/mPLUG-Owl/./pipeline/train_video.py", line 227, in <module> main() File "/workspace/nas-data/2023/xPLUG/mPLUG-Owl/./pipeline/train_video.py", line 222, in main trainer.train() File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train return inner_training_loop( File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2699, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2731, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.10/site-packages/peft/peft_model.py", line 416, in forward return self.get_base_model()(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/workspace/nas-data/2023/xPLUG/mPLUG-Owl/mplug_owl_video/modeling_mplug_owl.py", line 1547, in forward video_embeds = self.vision_model(video_pixel_values, return_dict=True).last_hidden_state File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/workspace/nas-data/2023/xPLUG/mPLUG-Owl/mplug_owl_video/modeling_mplug_owl.py", line 696, in forward encoder_outputs = self.encoder( File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/workspace/nas-data/2023/xPLUG/mPLUG-Owl/mplug_owl_video/modeling_mplug_owl.py", line 630, in forward layer_outputs = torch.utils.checkpoint.checkpoint( File "/usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/workspace/nas-data/2023/xPLUG/mPLUG-Owl/mplug_owl_video/modeling_mplug_owl.py", line 626, in custom_forward return module(*inputs, output_attentions) File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/workspace/nas-data/2023/xPLUG/mPLUG-Owl/mplug_owl_video/modeling_mplug_owl.py", line 374, in forward hidden_states = hidden_states + self.temporal(hidden_states) File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/workspace/nas-data/2023/xPLUG/mPLUG-Owl/mplug_owl_video/modeling_mplug_owl.py", line 215, in forward x = torch.nn.functional.conv3d( RuntimeError: "conv_depthwise3d" not implemented for 'BFloat16'

MAGAer13 commented 1 year ago

We are facing the same problem. So we manually convert bf16 into fp16 and convert back during training.