Inquiry on Audio Prompts Implementation in musicgen Model

I am currently exploring the musicgen model and have some questions regarding the application of audio prompts within the model's architecture, particularly in relation to the cross_attention layers:

Role of Audio Prompts: Is the audio prompt used as a cross-attention signal within the cross_attention layers of the musicgen model?

musicgen :
(transformer): StreamingTransformer(
(layers): ModuleList(
  (0-47): 48 x StreamingTransformerLayer(
    (self_attn): StreamingMultiheadAttention(
      (out_proj): Linear(in_features=2048, out_features=2048, bias=False)
    )
    (linear1): Linear(in_features=2048, out_features=8192, bias=False)
    (dropout): Dropout(p=0.0, inplace=False)
    (linear2): Linear(in_features=8192, out_features=2048, bias=False)
    (norm1): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.0, inplace=False)
    (dropout2): Dropout(p=0.0, inplace=False)
    (layer_scale_1): Identity()
    (layer_scale_2): Identity()
    (cross_attention): StreamingMultiheadAttention(
      (out_proj): Linear(in_features=2048, out_features=2048, bias=False)
    )
    (dropout_cross): Dropout(p=0.0, inplace=False)
    (norm_cross): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    (layer_scale_cross): Identity()
  )
)
)

Request for Training Code: Could you provide examples or documentation on how to properly use audio prompts as model inputs during training?

Thank you for your time and assistance.

facebookresearch / audiocraft

Inquiry on Audio Prompts Implementation in musicgen Model #451