facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.5k stars 2.06k forks source link

Cardinalities of LM and compression model don't match #235

Closed jbmaxwell closed 11 months ago

jbmaxwell commented 1 year ago

I'm trying to test training from scratch, using 48khz stereo audio, and I'm hitting the error: AssertionError: ("Cardinalities of the LM and compression model don't match: ", 'LM cardinality is 2048 vs ', 'compression model cardinality is 1024')

Where (or how) can I set the LM cardinality?

jbmaxwell commented 1 year ago

Okay, I found it in a yaml:

transformer_lm:
  n_q: 4
  card: 1024       # <-- Here!

Fixing this got me to a new error about number of codebooks which, digging around further, was connected to n_q. I fixed that and moved on to a new error:

File "/home/james/src/somms/audiocraft/audiocraft/modules/codebooks_patterns.py", line 334, in __init__
    assert len(self.delays) == self.n_q
AssertionError

I see this is connected to codebooks_patterns.py, but I'm not sure how to set this. More generally, does anybody know how to get an LM that will work with 48khz stereo from Facebook/encodec_48khz?

jbmaxwell commented 1 year ago

Okay, this seems to be solved with:

codebooks_pattern:
  modeling: parallel
  delays: 16

New error: ValueError: The input length is not properly padded for batched chunked decoding. Make sure to pad the input correctly.

Is padding handled by the data provider or in the LM?

jbmaxwell commented 1 year ago

Okay, my audio is in ~44s segments (2**21 samples), and it gets me one step (maybe) further using:

dataset:
  batch_size: 12  # 2 GPUs
  sample_on_weight: false  # Uniform sampling all the way
  sample_on_duration: false  # Uniform sampling all the way
  segment_duration: 1.0          # <--- this
  min_audio_duration: 1.0.      # <-- and this

I suppose I probably don't need to use such a short segment, so I may bump the segment_duration up.

but... new error: AssertionError: Scaled compression model not supported with LM.

UPDATE: I don't think this was a great solution, since it only works if I indicate segment_duration: 1.0, which doesn't make much sense. So, I'm back at the previous error: ValueError: The input length is not properly padded for batched chunked decoding. Make sure to pad the input correctly.

adefossez commented 12 months ago

I think the model you are trying to use won't work (I'm guessing the public stereo EnCodec model?)

Regarding this message: "The input length is not properly padded for batched chunked decoding" it is a bit strange, can you provide a more complete traceback ?

jbmaxwell commented 12 months ago

Yes, I've been informed that the likely reason is because it uses a "scaled" EnCodec model, where each token also has a normalization factor. So you can't just grab the sequence of tokens, but also need the normalizations, which the LM isn't learning.

Regarding the padding error, it's related to the overlapping of chunks in the 48khz encodec model. I noticed the 32khz model doesn't have any overlaps. Printing the input_length, chunk_length, and stride from encodec I get: input_length: 960000, chunk_length: 960000, stride: 960000 Whereas from the 48khz model I get: input_length: 1440000, chunk_length: 48000, stride: 47520, with no way (I can find) of manually setting the chunk_length and stride. Of course, I'm assuming I wouldn't want to anyway, since it's obviously set that way for a reason (the 48khz model uses an overlap, whereas the 32khz does not).