Running inference with distilled models?

Basically, I would like to run video retrieval using this distilled model: https://huggingface.co/OpenGVLab/InternVideo2_distillation_models/blob/main/stage1/L14/L14_dist_1B_stage2/pytorch_model.bin

I am using this code to load the distilled clip L14 Model:

import sys
import os

sys.path.append('kaggle/working/InternVideo/InternVideo2/multi_modality')
import numpy as np
import os
import io
import cv2

import torch

from demo.config import (Config,
                    eval_dict_leaf)

from demo.utils import (retrieve_text,
                  _frame_from_video,
                  setup_internvideo2)

config = Config.from_file('scripts/pretraining/clip/L14/config.py')
config = eval_dict_leaf(config)

And I have mobile_clip_blt.pt and 1B_clip.pth inside your_model_path. I also have the actual L14 model inside the current folder (pytorch_model.bin).

However, when I run this code:

intern_model, tokenizer = setup_internvideo2(config)

I get this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 intern_model, tokenizer = setup_internvideo2(config)

File /kaggle/working/InternVideo/InternVideo2/multi_modality/demo/utils.py:84, in setup_internvideo2(config)
     82     model = InternVideo2_Stage2(config=config, tokenizer=tokenizer, is_pretrain=True)
     83 else:
---> 84     model = InternVideo2_Stage2(config=config, is_pretrain=True)
     85     tokenizer = model.tokenizer
     87 if config.get('compile_model', False):

TypeError: InternVideo2_Stage2.__init__() missing 1 required positional argument: 'tokenizer'

Am I missing something here?

OpenGVLab / InternVideo

Running inference with distilled models? #185