facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.17k stars 2.01k forks source link

Throughput for autoregressive vs non-autoregressive #416

Open rubencart opened 5 months ago

rubencart commented 5 months ago

The paper states the following: "While the nonautoregressive model throughput is bounded to ∼ 2.8 samples/second for batch sizes bigger than 64, the autoregressive model throughput is linear in batch size, only limited by the GPU memory".

Could you explain why? Why would the throughput for the AR model be linear w.r.t. batch size, but the throughput for the non-AR model more or less constant?

I would expect more or less such a relation between throughput and sequence length, but I don't immediately see what causes this connection between throughput and batch size.

yukara-ikemiya commented 3 months ago

First of all, throughput in this case represents "how many audio samples can the model generate per second". So, naively thinking, if we use 2 times larger batch size, the model can parallelly generate samples of twice the number, which is what happens in the AR case in the Figure 2.

The difference of inference speed (throughput) between AR and non-AR arises from the following reasons.

  1. AR models can use key-value caching technique at inference time. Thanks to this, the model only have to compute key/values at the current position (only one token), so the computaiton order is almost linear.
  2. In contrast, non-AR models have to compute attention of "full sequence" at every inference step, which takes much much higher computatianl cost. I think this is the reason why the throughput of non-AR case is bounded even when using larger batch size.