The large version will crash out even on Colab A100. Maybe trying using the medium version.

facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation

Other

10.51k stars 1.02k forks source link

If you use the asr-pipeline or s2t-pipeline, you will get the OOM. The paper mentioned about streamming, but it's not. The model will accumulate all frame of the speech to the model when we reach the new ones. For example: First-chunk 320ms -> model will use 320ms for prediction Second-chunk 320ms -> model will use 320ms from previous + 320ms current to make the prediction... i.e model take 2x slowly prediction compared with first-chunk,... And so on.... So we don't have upper boundary memory cuda to make the model safe in memory. You should break the audio into small chunks, and inference separately. I didn't see the t2s part of the model, but maybe similarity with your issue.

facebookresearch / seamless_communication

The large version will crash out even on Colab A100. Maybe trying using the medium version. #433