YuanGongND / ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
390 stars 36 forks source link

Maximux Length for LTU-AS Audio Input #24

Open dingdongwang opened 9 months ago

dingdongwang commented 9 months ago

Hi, may I ask what the maximum allowable length is for audio input? Would a 1-minute WAV file be within the acceptable range?

Thank you!

YuanGongND commented 7 months ago

hi there,

It really depends on your GPU, but in general, 1 minute would be fine.

Our code supports 10 seconds (hard coded) at 3.2Hz, so 32 audio tokens. We have about 100-200 text tokens, so in total ~200 tokens.

For 1 minute, you would need 192 audio tokens, counting 100-200 text tokens, you would need ~400 tokens, which doubles our cost. And you would need some engineering effort to change our hard coded part.

-Yuan