facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.51k stars 1.02k forks source link

The large version will crash out even on Colab A100. Maybe trying using the medium version. #433

Closed developeranalyser closed 2 months ago

developeranalyser commented 2 months ago
          The large version will crash out even on Colab A100. Maybe trying using the medium version.

_Originally posted by @zrthxn in https://github.com/facebookresearch/seamless_communication/issues/421#issuecomment-2070129853_

Does anyone have a better idea? I want to use V2 features and functions for 3 limited languages Can you kindly guide me how to ????

v-tuenv commented 2 months ago

If you use the asr-pipeline or s2t-pipeline, you will get the OOM. The paper mentioned about streamming, but it's not. The model will accumulate all frame of the speech to the model when we reach the new ones. For example: First-chunk 320ms -> model will use 320ms for prediction Second-chunk 320ms -> model will use 320ms from previous + 320ms current to make the prediction... i.e model take 2x slowly prediction compared with first-chunk,... And so on.... So we don't have upper boundary memory cuda to make the model safe in memory. You should break the audio into small chunks, and inference separately. I didn't see the t2s part of the model, but maybe similarity with your issue.