Open jbmaxwell opened 10 months ago
please do share audio samples if youre willing to đź‘€
Hmm... looks like I'd have to covert to mp4 or something... I'm actually thinking maybe it is generative throughout. My previous samples were all techno/house with no vocals, but I just checked one with a "vocal" and it's pretty clearly not actually saying anything, just kind of mimicking rhythmic patterns of word-like sounds. From there it just falls into a short loop (which I'm pretty sure is under-fitting—I'm only on epoch 25). But if anyone knows for sure that it's generating the entire clip it would be good to know for sure.
After 50 epochs I'm getting the same thing—roughly 8 seconds of music, then some kind of stasis or short loop:
https://github.com/facebookresearch/audiocraft/assets/15166432/f54cf050-a1c8-449f-8fe6-170e10471c7a https://github.com/facebookresearch/audiocraft/assets/15166432/33ec1c1b-00a4-405f-9b35-4d25c907e893 https://github.com/facebookresearch/audiocraft/assets/15166432/c032f308-7fe2-468d-8e9f-9b12a02f6795
This is from my solver (which is just tweaked from musicgen_base_32khz.yaml
:
generate:
every: 25
num_workers: 5
path: samples
audio:
format: wav
strategy: loudness
sample_rate: ${sample_rate}
loudness_headroom_db: 14
lm:
prompted_samples: true
unprompted_samples: true
gen_gt_samples: false
prompt_duration: null # if not set, will use dataset.generate.segment_duration / 4
gen_duration: null # if not set, will use dataset.generate.segment_duration
remove_prompts: false
# generation params
use_sampling: false
temp: 1.0
top_k: 0
top_p: 0.0
I see here that "prompt_duration" is segment_duration / 4
, which I guess would explain the 8-ish second intro of reasonable music. How many epochs are people training on to get some kind of reasonable output? I'm just fine-tuning on fma_small
for now, btw.
on what duration are you training ? 30 seconds samples ?
Yeah, 30 seconds—so the 30 / 4 is my 8-ish (7.5) seconds of good output explained. But I think it's actually that I was inadvertently training from scratch, rather than fine-tuning. I updated the defaults
section of my solver config to include - override /model/lm/model_scale: medium
and it seems much better now. I guess I'd mistakenly assumed that the "default" training mode would be fine-tuning, so I hadn't specifically set the base model? That's a guess, since I haven't seen the intended behaviour documented anywhere.
Do you know where I can find documentation of all the solver config parameters/fields?
I'm also not clear on what is being indicated with the strings "unprompted_description" and "prompted_description" in the sample generations. Is that explained anywhere? (I don't see anything clear in the docs, the paper, or the comments in the musicgen.py
solver code.)
I'm interested in this too. In gen_unprompted_outputs, the prompt_duration
is set to prompt_duration = dataset_duration / 4 rather than None
.
If I understand this correctly, "unprompted" samples are still being prompted, hence the high-quality first few seconds.
I'm fine-tuning MusicGen on my dataset and noticed that the samples generated during training tend to have a very good first 10 seconds, followed by a not-so-great 20 seconds (often very repetitive). Since the underlying algorithm is autoregressive I could see this being a result of under-fitting, but since first 10 seconds is already surprisingly good, I wondered whether that was just original audio from the eval set? Does the generation run during training include a 10-second seed? I haven't seen this mentioned anywhere.
Thanks in advance for any insights.