OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.31k stars 85 forks source link

Cannot load ViT-L/14 pretrained model. #49

Closed trThanhnguyen closed 10 months ago

trThanhnguyen commented 1 year ago

Hi authors, thank you for sharing your work, I appreciate that. I'm trying to utilize the pretrained model of ViT-L/14 for my video-text-retrieval application. I followed the link to download ViT-L/14 that you put in './Downstream/Video-Text-Retrieval/README.md' (in Pre-trained Weights) (namely: https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt) and then load ckpt weights to it. However, the model cannot be loaded with torch.load(). Notes: I followed your environment guidance strictly.

Log of errors:

UserWarning: 'torch.load' received a zip file that looks like a TorchScript archive dispatching to 'torch.jit.load' (call 'torch.jit.load' directly to silence this warning)
  " silence this warning)", UserWarning)

Traceback (most recent call last):
  File "/home/hoangtv/anaconda3/envs/internvideo/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/hoangtv/anaconda3/envs/internvideo/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hoangtv/Desktop/NguyenTN/InternVideo/Downstream/Video-Text-Retrieval/inference.py", line 638, in <module>
    main()
  File "/home/hoangtv/Desktop/NguyenTN/InternVideo/Downstream/Video-Text-Retrieval/inference.py", line 599, in main
    model = init_model(args, device, n_gpu, args.rank)
  File "/home/hoangtv/Desktop/NguyenTN/InternVideo/Downstream/Video-Text-Retrieval/inference.py", line 316, in init_model
    model = CLIP4Clip.from_pretrained(args.cross_model, cache_dir=cache_dir, state_dict=model_state_dict, task_config=args)
  File "/home/hoangtv/Desktop/NguyenTN/InternVideo/Downstream/Video-Text-Retrieval/modules/modeling.py", line 79, in from_pretrained
    model = cls(cross_config, clip_state_dict, *inputs, **kwargs)
  File "/home/hoangtv/Desktop/NguyenTN/InternVideo/Downstream/Video-Text-Retrieval/modules/modeling.py", line 266, in __init__
    self.clip, _ = clip_evl.load(task_config.pretrained_path, t_size=task_config.max_frames, mergeclip=task_config.mergeclip, mergeweight=task_config.mergeweight, clip_state_dict=clip_state_dict)
  File "/home/hoangtv/Desktop/NguyenTN/InternVideo/Downstream/Video-Text-Retrieval/modules/clip_evl/clip.py", line 142, in load
    init_state_dict = torch.load(model_path, map_location='cpu')['state_dict']
  File "/home/hoangtv/anaconda3/envs/internvideo/lib/python3.6/site-packages/torch/jit/_script.py", line 621, in __getitem__
    return self.forward_magic_method("__getitem__", idx)
  File "/home/hoangtv/anaconda3/envs/internvideo/lib/python3.6/site-packages/torch/jit/_script.py", line 614, in forward_magic_method
    raise NotImplementedError()
NotImplementedError

Can anyone provide me with some insights and workarounds? Your help would be much appreciated.

yinanhe commented 10 months ago

What is your Python version? It would be helpful if you could provide more detailed environment information.

trThanhnguyen commented 10 months ago

I had used it successfully using the demo code in Pretrain/Multi-Modalities-Pretraining/InternVideo. And now I can't reproduce the issue anymore. Thank you.

1240446371 commented 6 months ago

how do you settle this issue? I have the same issues....could you help me? thanks!

trThanhnguyen commented 6 months ago

Hi @1240446371 , My workaround was to use the InternVideo class in "./Pretrain/Multi-Modalities-Pretraining/" and my loaded model was "InternVideo-MM-B-16.ckpt". To be specific:

import InternVideo
...
model = InternVideo.load_model('my/path/to/ckpt/InternVideo-MM-B-16.ckpt').to(device)

It suited my application then, which is simply to encode short videos and store the embeddings for later searching. Good luck.

mustafahalimeh commented 1 week ago

Would you provide an example on how to load a pretrained model from the action classification category?