DAMO-NLP-SG / Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
BSD 3-Clause "New" or "Revised" License
2.67k stars 241 forks source link

Problem running demo: Loading checkpoint shards never finishes #165

Open jpssoares opened 2 months ago

jpssoares commented 2 months ago

My specs are:

GPU 0: NVIDIA A100-PCIE-40GB
MEM: 60 GB

My config file looks like:

model:
  arch: video_llama
  model_type: pretrain_vicuna
  freeze_vit: True
  freeze_qformer: True
  max_txt_len: 512
  end_sym: "###"
  low_resource: False

  frozen_llama_proj: False

  # If you want use LLaMA-2-chat,
  # some ckpts could be download from our provided huggingface repo
  # i.e.  https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-13B-Finetuned
  llama_model: "/user/home/j.soares/data/models/Video-LLaMA-2-7B-Pretrained/llama-2-7b-chat-hf"
  imagebind_ckpt_path: "/user/home/j.soares/data/models/Video-LLaMA-2-7B-Pretrained/"
  ckpt: '/user/home/j.soares/data/models/Video-LLaMA-2-7B-Pretrained/VL_LLaMA_2_7B_Pretrained.pth'   # you can use our pretrained ckpt from https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-13B-Pretrained/
  ckpt_2:  '/user/home/j.soares/data/models/Video-LLaMA-2-7B-Pretrained/AL_LLaMA_2_7B_Pretrained.pth'

  equip_audio_branch: True  # whether equips the audio branch
  fusion_head_layers: 2
  max_frame_pos: 32
  fusion_header_type: "seqTransf"

datasets:
  webvid:
    vis_processor:
      train:
        name: "alpro_video_eval"
        n_frms: 8
        image_size: 224
    text_processor:
      train:
        name: "blip_caption"

run:
  task: video_text_pretrain

When I run the demo, I get stuck here:

/user/home/j.soares/.conda/envs/videollama/lib/python3.9/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
  warnings.warn(
/user/home/j.soares/.conda/envs/videollama/lib/python3.9/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
  warnings.warn(
    Using pad_token, but it is not set yet.
Initializing Chat
Loading VIT
Loading VIT Done
Loading Q-Former
Loading checkpoint shards: 100%|██████████| 2/2 [00:14<00:00,  7.42s/it]

What should I try?

jpssoares commented 2 months ago

I have tried both Pretrained and finetuned models but same problem occurs.