facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.15k stars 2.01k forks source link

Inquiry on Audio Prompts Implementation in musicgen Model #451

Open LiuZH-19 opened 2 months ago

LiuZH-19 commented 2 months ago

I am currently exploring the musicgen model and have some questions regarding the application of audio prompts within the model's architecture, particularly in relation to the cross_attention layers:

  1. Role of Audio Prompts: Is the audio prompt used as a cross-attention signal within the cross_attention layers of the musicgen model?

    musicgen :
    (transformer): StreamingTransformer(
    (layers): ModuleList(
      (0-47): 48 x StreamingTransformerLayer(
        (self_attn): StreamingMultiheadAttention(
          (out_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (linear1): Linear(in_features=2048, out_features=8192, bias=False)
        (dropout): Dropout(p=0.0, inplace=False)
        (linear2): Linear(in_features=8192, out_features=2048, bias=False)
        (norm1): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.0, inplace=False)
        (dropout2): Dropout(p=0.0, inplace=False)
        (layer_scale_1): Identity()
        (layer_scale_2): Identity()
        (cross_attention): StreamingMultiheadAttention(
          (out_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (dropout_cross): Dropout(p=0.0, inplace=False)
        (norm_cross): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (layer_scale_cross): Identity()
      )
    )
    )
    1. Request for Training Code: Could you provide examples or documentation on how to properly use audio prompts as model inputs during training?

    Thank you for your time and assistance.