Throughput for autoregressive vs non-autoregressive

facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

MIT License

20.17k stars 2.01k forks source link

First of all, throughput in this case represents "how many audio samples can the model generate per second". So, naively thinking, if we use 2 times larger batch size, the model can parallelly generate samples of twice the number, which is what happens in the AR case in the Figure 2.

The difference of inference speed (throughput) between AR and non-AR arises from the following reasons.

AR models can use key-value caching technique at inference time. Thanks to this, the model only have to compute key/values at the current position (only one token), so the computaiton order is almost linear.
In contrast, non-AR models have to compute attention of "full sequence" at every inference step, which takes much much higher computatianl cost. I think this is the reason why the throughput of non-AR case is bounded even when using larger batch size.

facebookresearch / audiocraft

Throughput for autoregressive vs non-autoregressive #416