facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.15k stars 2.01k forks source link

Ambiguity in MusicGen architecture #468

Open rtavasso1 opened 3 weeks ago

rtavasso1 commented 3 weeks ago

I have two 3 discrepancies between what is described in the paper versus what I see in code/blog posts.

  1. The recent publishing of MMD had this figure image which shows a concatenation operation between the audio embeddings and the output of the cross attention. I cannot find this operation in the code for the LM.

  2. There is no linear layer after the cross attention block that I can see in the code.

  3. The config for the small model calls for 24 layers, dim 1024, 16 heads, which when initialized, is ~420m parameters. Is the config incorrrect? Thanks!

DBraun commented 1 week ago
  1. I think the EnCodec-MMD is still on this branch: https://github.com/jmlemercier/audiocraft/blob/encodec-mmd/docs/MMD.md
  2. Look at this stack trace:

  3. I have a similar question about the small model here: https://github.com/facebookresearch/audiocraft/issues/169#issuecomment-2184402138