Ambiguity in MusicGen architecture

facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

MIT License

20.15k stars 2.01k forks source link

Ambiguity in MusicGen architecture #468

Open rtavasso1 opened 3 weeks ago

rtavasso1 commented 3 weeks ago

I have two 3 discrepancies between what is described in the paper versus what I see in code/blog posts.

The recent publishing of MMD had this figure which shows a concatenation operation between the audio embeddings and the output of the cross attention. I cannot find this operation in the code for the LM.
There is no linear layer after the cross attention block that I can see in the code.
The config for the small model calls for 24 layers, dim 1024, 16 heads, which when initialized, is ~420m parameters. Is the config incorrrect? Thanks!

DBraun commented 1 week ago

I think the EnCodec-MMD is still on this branch: https://github.com/jmlemercier/audiocraft/blob/encodec-mmd/docs/MMD.md
Look at this stack trace:
- https://github.com/facebookresearch/audiocraft/blob/adf0b04a4452f171970028fcf80f101dd5e26e19/audiocraft/models/lm.py#L257
- https://github.com/facebookresearch/audiocraft/blob/adf0b04a4452f171970028fcf80f101dd5e26e19/audiocraft/modules/transformer.py#L708
- https://github.com/facebookresearch/audiocraft/blob/adf0b04a4452f171970028fcf80f101dd5e26e19/audiocraft/modules/transformer.py#L565 The "Linear" in the diagram probably refers to the _ff_block (feed-forward block)
I have a similar question about the small model here: https://github.com/facebookresearch/audiocraft/issues/169#issuecomment-2184402138