RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
267 stars 23 forks source link

Inference with audio #29

Closed lakshya-frontera closed 2 months ago

lakshya-frontera commented 4 months ago

Thank you for this amazing work.

I have been trying to run the inference script (i.e. demo.ipynb) but there is no function in there which takes ASR transcript along with the video. It would be great, if you could point me to the function which also takes ASR transcript for answer generation or provide that script.

tiesanguaixia commented 4 months ago

same question

RenShuhuai-Andy commented 4 months ago

Hi, thanks for your interest.

We currently have code for ASR available for pre-processing purposes (see https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#automatic-speech-transcription).

I agree that it would be beneficial to integrate this into a function for easier use. I plan to add this feature when I have some free time. Alternatively, if you're interested, you could contribute to adding this feature. Let me know if you're interested!