Closed jbmaxwell closed 11 months ago
Okay, I found it in a yaml:
transformer_lm:
n_q: 4
card: 1024 # <-- Here!
Fixing this got me to a new error about number of codebooks which, digging around further, was connected to n_q
.
I fixed that and moved on to a new error:
File "/home/james/src/somms/audiocraft/audiocraft/modules/codebooks_patterns.py", line 334, in __init__
assert len(self.delays) == self.n_q
AssertionError
I see this is connected to codebooks_patterns.py
, but I'm not sure how to set this. More generally, does anybody know how to get an LM that will work with 48khz stereo from Facebook/encodec_48khz
?
Okay, this seems to be solved with:
codebooks_pattern:
modeling: parallel
delays: 16
New error: ValueError: The input length is not properly padded for batched chunked decoding. Make sure to pad the input correctly.
Is padding handled by the data provider or in the LM?
Okay, my audio is in ~44s segments (2**21 samples), and it gets me one step (maybe) further using:
dataset:
batch_size: 12 # 2 GPUs
sample_on_weight: false # Uniform sampling all the way
sample_on_duration: false # Uniform sampling all the way
segment_duration: 1.0 # <--- this
min_audio_duration: 1.0. # <-- and this
I suppose I probably don't need to use such a short segment, so I may bump the segment_duration
up.
but... new error: AssertionError: Scaled compression model not supported with LM.
UPDATE: I don't think this was a great solution, since it only works if I indicate segment_duration: 1.0
, which doesn't make much sense. So, I'm back at the previous error: ValueError: The input length is not properly padded for batched chunked decoding. Make sure to pad the input correctly.
I think the model you are trying to use won't work (I'm guessing the public stereo EnCodec model?)
Regarding this message: "The input length is not properly padded for batched chunked decoding" it is a bit strange, can you provide a more complete traceback ?
Yes, I've been informed that the likely reason is because it uses a "scaled" EnCodec model, where each token also has a normalization factor. So you can't just grab the sequence of tokens, but also need the normalizations, which the LM isn't learning.
Regarding the padding error, it's related to the overlapping of chunks in the 48khz encodec model. I noticed the 32khz model doesn't have any overlaps. Printing the input_length, chunk_length, and stride from encodec I get:
input_length: 960000, chunk_length: 960000, stride: 960000
Whereas from the 48khz model I get:
input_length: 1440000, chunk_length: 48000, stride: 47520
, with no way (I can find) of manually setting the chunk_length and stride. Of course, I'm assuming I wouldn't want to anyway, since it's obviously set that way for a reason (the 48khz model uses an overlap, whereas the 32khz does not).
I'm trying to test training from scratch, using 48khz stereo audio, and I'm hitting the error:
AssertionError: ("Cardinalities of the LM and compression model don't match: ", 'LM cardinality is 2048 vs ', 'compression model cardinality is 1024')
Where (or how) can I set the LM cardinality?