facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.17k stars 2.01k forks source link

Encodec low-latency streamable setup configuration #411

Open kawukawai opened 5 months ago

kawukawai commented 5 months ago

In the Encodec paper, non-streamable and streamable inference setups are described:

Streamable. For the streamable setup, all padding is put before the first time step. For a transposed convolution with stride s, we output the s first time steps, and keep the remaining s steps in memory for completion when the next frame is available, or discarding it at the end of a stream. Thanks to this padding scheme, the model can output 320 samples (13 ms) as soon as the first 320 samples (13 ms) are received. We replace the layer normalization with statistics computed over the time dimension with weight normalization (Salimans & Kingma, 2016), as the former is ill-suited for a streaming setup ...

To the best of my knowledge, by default, Encodec in this repo operates in the non-streamable setup with overlapping chunks. It's not clear to me (a) if the streamable setup is implemented and (b) how to implement the streamable setup without audible discontinuities between frames.

Is the streamable model available?