Issues with Converting the video-llava Model to ONNX

Hello,

I am facing some challenges while trying to convert the video-llava model(LanguageBind/Video-LLaVA-7B-hf) to ONNX, and I would appreciate some assistance. I'm not sure if I'm doing it the right way.

# What I've Tried:

Conversion using torch.jit.script:

To handle the if-else conditions within the forward function, I attempted to convert the model using torch.jit.script.
However, the conversion failed due to an issue with the ClipConfig object in the CLIP modeling part.

2.Conversion using torch.jit.trace:

I split the forward function according to the conditions and created two separate ONNX models using torch.jit.trace.
The models were split based on whether pixel_value_image_values (video) is None or not:

A model for when pixel_value_image_values is not None A model for when pixel_value_image_values is None
While I successfully handled the inputs through debugging, I am uncertain about how to correctly convert the output (VideoLlavaCausalLMOutputWithPast) to ONNX.

# Questions:

Is there a way to resolve the issue with the ClipConfig object? I would like to know how to successfully use torch.jit.script in this context.
After creating the two ONNX models with torch.jit.trace, what is the best approach to convert the VideoLlavaCausalLMOutputWithPast output to ONNX? I would appreciate any guidance or examples that could help.

Has anyone successfully converted Video-LLaVA or a similar model to ONNX? If so, could you share your approach or any code snippets that might help?
Are there any existing tools or scripts specifically tailored to converting multi-modal models like Video-LLaVA to ONNX?

Thank you in advance for any help or pointers you can provide!

Best regards,

PKU-YuanGroup / Video-LLaVA

Issues with Converting the video-llava Model to ONNX #190