facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.17k stars 2.01k forks source link

Never understood the architecture of the model, can someone explain it to me? #410

Closed tanggang1997 closed 5 months ago

tanggang1997 commented 5 months ago

My current understanding is that the text encoder + transformer + Encodec audio codec using a complex codebook interleaving model, the core of which is to convert the text codebook into an audio codebook using the transformer model, and then decode the audio using the Encodec decoder

DEBIHOOD commented 5 months ago

First of all, musicgen uses encodec for "audio-tokenization", so, we need to train an encodec model. Authors of musicgen paper trained an encodec autoencoder that takes as input 1 second of 32kHz audio, compresses it down to 4 pairs of 50 tokens, these tokens are selected from the dictionary with size of 2048(i believe the dictionary is the same between these 4 pairs of tokens). Then we can use the decoder part of encodec to reconstruct audio back to it's raw waveform, so we get this same 1 second of audio in the end. This is how 1 second of audio looks like after it was compressed using the encoder part of encodec32kHz model: encodec_encoded 4 pairs of 50 tokens, each token has it's ID ranging between 1-2048, colored because it looks nice that way.


Now we just take our music dataset and encode it into these tokens, or we can also say "tokenize" our audio, using this encoded model that we just trained. Then we train almost-vanilla decoder transformer on these tokens, with the task of "having all of the previous tokens, what is the next one?". If the model is text-conditioned(all released musicgen model's are actually text-conditioned), we also use T5 model to encode the text-description of the song, get the embeddings, and feed them into transformer using cross-attention. I say almost-vanilla, because we somehow need to also compute these 4 pairs, and vanilla transformer is "1 dimensional" in that sense(i hope my intuition behind this is accurate). Something that they(authors) used here, is called interleaving, and in the paper it was completely skipped about the details of how it works inside the transformer model, so i have no idea of how it does what it does. Anyone who could explain how exactly interleaving works, it would be awesome! The pattern that they used for interleaving the 4 pairs of codebooks is delay. More info on delay interleaving pattern is in the paper.


After training the transformer, we just inference it, and decode the tokens using encodec decoder.

tanggang1997 commented 5 months ago

Thank you very much.So the core is the vanilla transformer? I've always found the thesis unclear and haven't understood what this transformer is all about, and the nested dora in the code makes it look like a pain in the arse for me

DEBIHOOD commented 5 months ago

Yeah, it's pretty much vanilla transformer trained on tokenized audio. You can also look up OpenAI's Jukebox blog post, where they explain how it all works, the only major difference between them is that jukebox generates coarse audio at first, and then upsamples it using other transformer model, while musicgen generates it all in one-go.

tanggang1997 commented 5 months ago

Ok,thank you!!!