OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.31k stars 85 forks source link

Running inference with distilled models? #185

Closed qingy1337 closed 1 day ago

qingy1337 commented 1 day ago

Basically, I would like to run video retrieval using this distilled model: https://huggingface.co/OpenGVLab/InternVideo2_distillation_models/blob/main/stage1/L14/L14_dist_1B_stage2/pytorch_model.bin

I am using this code to load the distilled clip L14 Model:

import sys
import os

sys.path.append('kaggle/working/InternVideo/InternVideo2/multi_modality')
import numpy as np
import os
import io
import cv2

import torch

from demo.config import (Config,
                    eval_dict_leaf)

from demo.utils import (retrieve_text,
                  _frame_from_video,
                  setup_internvideo2)

config = Config.from_file('scripts/pretraining/clip/L14/config.py')
config = eval_dict_leaf(config)

And I have mobile_clip_blt.pt and 1B_clip.pth inside your_model_path. I also have the actual L14 model inside the current folder (pytorch_model.bin).

However, when I run this code:

intern_model, tokenizer = setup_internvideo2(config)

I get this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 intern_model, tokenizer = setup_internvideo2(config)

File /kaggle/working/InternVideo/InternVideo2/multi_modality/demo/utils.py:84, in setup_internvideo2(config)
     82     model = InternVideo2_Stage2(config=config, tokenizer=tokenizer, is_pretrain=True)
     83 else:
---> 84     model = InternVideo2_Stage2(config=config, is_pretrain=True)
     85     tokenizer = model.tokenizer
     87 if config.get('compile_model', False):

TypeError: InternVideo2_Stage2.__init__() missing 1 required positional argument: 'tokenizer'

Am I missing something here?

qingy1337 commented 1 day ago

Update: I got it working by using this config file and this model class. I passed the config into the class and used that to do text retrieval.

BTW, for anyone with the same problem, you need these files:

  1. 1B_clip.pth
  2. InternVideo2 checkpoint
  3. mobileclip_blt.pth