PKU-YuanGroup / Video-LLaVA

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Apache License 2.0
2.83k stars 202 forks source link

ImportError: cannot import name '_expand_mask' from 'transformers.models.clip.modeling_clip' #184

Open qiuchen001 opened 1 month ago

qiuchen001 commented 1 month ago

scenes: CLI Inference

command: CUDA_VISIBLE_DEVICES=0 python3 -m videollava.serve.cli --model-path "/root/Video-LLaVA-7B" --file "/root/videos/8132-207209040_small.mp4" --load-4bit

issues: [2024-07-21 04:02:21,967] [INFO] [] Setting ds_accelerator to cuda (auto detect) Traceback (most recent call last): File "/root/.conda/envs/video-llava/lib/python3.10/", line 187, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/root/.conda/envs/video-llava/lib/python3.10/", line 110, in _get_module_details import(pkg_name) File "/root/Video-LLaVA/videollava/", line 1, in from .model import LlavaLlamaForCausalLM File "/root/Video-LLaVA/videollava/model/", line 1, in from .language_model.llava_llama import LlavaLlamaForCausalLM, LlavaConfig File "/root/Video-LLaVA/videollava/model/language_model/", line 26, in from ..llava_arch import LlavaMetaModel, LlavaMetaForCausalLM File "/root/Video-LLaVA/videollava/model/", line 21, in from .multimodal_encoder.builder import build_image_tower, build_video_tower File "/root/Video-LLaVA/videollava/model/multimodal_encoder/", line 3, in from .languagebind import LanguageBindImageTower, LanguageBindVideoTower File "/root/Video-LLaVA/videollava/model/multimodal_encoder/languagebind/", line 6, in from .image.modeling_image import LanguageBindImage File "/root/Video-LLaVA/videollava/model/multimodal_encoder/languagebind/image/", line 11, in from transformers.models.clip.modeling_clip import CLIPMLP, CLIPAttention, CLIPTextEmbeddings, CLIPVisionEmbeddings, \ ImportError: cannot import name '_expand_mask' from 'transformers.models.clip.modeling_clip' (/root/.conda/envs/video-llava/lib/python3.10/site-packages/transformers/models/clip/

I've already install required packages:

git clone
cd Video-LLaVA
conda create -n videollava python=3.10 -y
conda activate videollava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python git+

AND pip install -U transformers

Stevetich commented 1 month ago

I have encountered the same problem. It seems that this problem is since the transformers version.

sunlight146 commented 1 month ago

Did you solve the error? I encountered the same error while debugging the Video-LLaVA code.

Wuyingwen commented 1 month ago

you can copy the following code into corresponding transformer libarary to solve the problem

def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None): bsz, src_len = mask.size() tgt_len = tgt_len if tgt_len is not None else src_len expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype) inverted_mask = 1.0 - expanded_mask return inverted_mask.masked_fill(, torch.finfo(dtype).min)