MAGNeT inference failing with ValueError: Invalid shape for attention bias

facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

MIT License

20.17k stars 2.01k forks source link

MAGNeT inference failing with ValueError: Invalid shape for attention bias #390

Closed nateraw closed 5 months ago

nateraw commented 5 months ago

Hey there! :) Thank you for the awesome work here.

Really looking forward to playing with the MAGNeT models, but getting the following when running inference with the new MAGNeT model source code from the main branch.

ValueError: Invalid shape for attention bias: torch.Size([498, 498]) (expected (6, 16, 498, 498))
  query.shape: torch.Size([6, 498, 16, 64])
  key.shape  : torch.Size([6, 498, 16, 64])
  value.shape: torch.Size([6, 498, 16, 64])

Reproducible example in Colab (including full stack trace) here: https://gist.github.com/nateraw/4b5fb6d62a37df83e3528f410ad438c9

CC @lonzi, as you were the one who merged this.

xpeng commented 5 months ago

comfirmed when using model magnet-medium-30secs.

ValueError: Invalid shape for attention bias: torch.Size([1500, 1500]) (expected (4, 24, 1500, 1500))
  query.shape: torch.Size([4, 1500, 24, 64])
  key.shape  : torch.Size([4, 1500, 24, 64])
  value.shape: torch.Size([4, 1500, 24, 64])

akashicMarga commented 5 months ago

seems like issue with all checkpoint’s. i tried all the models but didn't work. all of them was throwing shape error.

mkfold commented 5 months ago

haven't signed the CLA but here's a patch to audiocraft/modules/transformer.py that should get you up and running:

diff --git a/audiocraft/modules/transformer.py b/audiocraft/modules/transformer.py
index 818e98c..1565ec8 100644
--- a/audiocraft/modules/transformer.py
+++ b/audiocraft/modules/transformer.py
@@ -405,6 +405,11 @@ class StreamingMultiheadAttention(StreamingModule):
                     seq_len = query.shape[1]
                     attn_mask = attn_mask.to(q.dtype)
                     attn_mask = attn_mask[:seq_len, :seq_len]
+                    if time_dim == 1:
+                        n, _, h, *_ = q.shape
+                    else:
+                        n, h, *_ = q.shape
+                    attn_mask = attn_mask.expand(n, h, -1, -1)
                 p = self.dropout if self.training else 0
                 if _efficient_attention_backend == 'torch':
                     x = torch.nn.functional.scaled_dot_product_attention(

not sure if the attention mask is meant to be spread across examples/heads like this, but training and inference work now.

akashicMarga commented 5 months ago

how is the output after this? @mkfold

lonzi commented 5 months ago

Hi @nateraw, @xpeng @singhaki @mkfold. Thanks for pointing out on this issue! We reproduced it and we will fix it ASAP.

joe-none416 commented 5 months ago

another temp work around, I followed link and recompile xformers with version=v0.0.20, and it works.

lonzi commented 5 months ago

Yeah this should be indeed a valid workaround. I see that the current code works for: xformers 0.0.20 And doesn't work for: xformers 0.0.22

We will try to make it compatible for both

lonzi commented 5 months ago

@nateraw @xpeng @singhaki @mkfold - see our fix in: https://github.com/facebookresearch/audiocraft/commit/7dece43a4d186e47e5e1c67983ed10a99f225948

Should be both xformers 0.0.20 and 0.0.22 compatible. This is still under testing and review, and wasn't merged to main branch yet.

lonzi commented 5 months ago

Fixed in: https://github.com/facebookresearch/audiocraft/commit/2a5c5e971915aa03bd99defe44bde2a4bebb4361

closing the issue.