facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.93k stars 1.06k forks source link

Real-Time Implementation of Seamless Expressive #310

Open HardikJain02 opened 10 months ago

HardikJain02 commented 10 months ago

Anyone can help me with how to implement seamless expressive in real-time with no latency ? Also, Suggest me some code references to implement. I am also interested in learning how to make these type of tech real-time with lowest possible latency? How does one know that this is the minimum latency one can achieve?

What's the latency and accuracy difference between Direct Speech-to-Speech Translation & Speech-to-Text followed by Text-to-Speech Translation?

My main agenda is to implement best speech-to-speech translation in real-time. Any other help than seamless expressive will work too.

annasun28 commented 10 months ago

@HardikJain02

how to implement seamless expressive in real-time with no latency ?

SeamlessStreaming is the real-time model, and "Seamless" is the unified seamless streaming + expressive model. You can check out https://huggingface.co/spaces/facebook/seamless-streaming/blob/main/README.md for an example implementation of the streaming demo in HF.

You can also check out and run the colab notebook at https://fb.me/mt-neurips for an example of standalone inference (which simulates passing chunks of input audio to the streaming model).

What's the latency and accuracy difference between Direct Speech-to-Speech Translation & Speech-to-Text followed by Text-to-Speech Translation?

We don't directly compare a cascaded S2T + TTS system to a direct S2ST system on latency in the paper, but in earlier experiments, we found that a baseline cascaded system had higher inference delays and worse quality which degraded the streaming S2ST naturalness and overall system latency.