PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
2.99k stars 219 forks source link

Issues with Converting the video-llava Model to ONNX #190

Closed Ark1a closed 2 months ago

Ark1a commented 3 months ago

Hello,

I am facing some challenges while trying to convert the video-llava model(LanguageBind/Video-LLaVA-7B-hf) to ONNX, and I would appreciate some assistance. I'm not sure if I'm doing it the right way.

# What I've Tried:

  1. Conversion using torch.jit.script:

2.Conversion using torch.jit.trace:

# Questions:

  1. Is there a way to resolve the issue with the ClipConfig object? I would like to know how to successfully use torch.jit.script in this context.

  2. After creating the two ONNX models with torch.jit.trace, what is the best approach to convert the VideoLlavaCausalLMOutputWithPast output to ONNX? I would appreciate any guidance or examples that could help.

  1. Has anyone successfully converted Video-LLaVA or a similar model to ONNX? If so, could you share your approach or any code snippets that might help?

  2. Are there any existing tools or scripts specifically tailored to converting multi-modal models like Video-LLaVA to ONNX?

Thank you in advance for any help or pointers you can provide!

Best regards,