Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
I have two 3 discrepancies between what is described in the paper versus what I see in code/blog posts.
The recent publishing of MMD had this figure
which shows a concatenation operation between the audio embeddings and the output of the cross attention. I cannot find this operation in the code for the LM.
There is no linear layer after the cross attention block that I can see in the code.
The config for the small model calls for 24 layers, dim 1024, 16 heads, which when initialized, is ~420m parameters. Is the config incorrrect?
Thanks!
I have two 3 discrepancies between what is described in the paper versus what I see in code/blog posts.
The recent publishing of MMD had this figure which shows a concatenation operation between the audio embeddings and the output of the cross attention. I cannot find this operation in the code for the LM.
There is no linear layer after the cross attention block that I can see in the code.
The config for the small model calls for 24 layers, dim 1024, 16 heads, which when initialized, is ~420m parameters. Is the config incorrrect? Thanks!