facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.19k stars 2.02k forks source link

Do samples generated during training include a "real" seed? #271

Open jbmaxwell opened 10 months ago

jbmaxwell commented 10 months ago

I'm fine-tuning MusicGen on my dataset and noticed that the samples generated during training tend to have a very good first 10 seconds, followed by a not-so-great 20 seconds (often very repetitive). Since the underlying algorithm is autoregressive I could see this being a result of under-fitting, but since first 10 seconds is already surprisingly good, I wondered whether that was just original audio from the eval set? Does the generation run during training include a 10-second seed? I haven't seen this mentioned anywhere.

Thanks in advance for any insights.

0xlws commented 10 months ago

please do share audio samples if youre willing to đź‘€

jbmaxwell commented 10 months ago

Hmm... looks like I'd have to covert to mp4 or something... I'm actually thinking maybe it is generative throughout. My previous samples were all techno/house with no vocals, but I just checked one with a "vocal" and it's pretty clearly not actually saying anything, just kind of mimicking rhythmic patterns of word-like sounds. From there it just falls into a short loop (which I'm pretty sure is under-fitting—I'm only on epoch 25). But if anyone knows for sure that it's generating the entire clip it would be good to know for sure.

jbmaxwell commented 10 months ago

After 50 epochs I'm getting the same thing—roughly 8 seconds of music, then some kind of stasis or short loop:

https://github.com/facebookresearch/audiocraft/assets/15166432/f54cf050-a1c8-449f-8fe6-170e10471c7a https://github.com/facebookresearch/audiocraft/assets/15166432/33ec1c1b-00a4-405f-9b35-4d25c907e893 https://github.com/facebookresearch/audiocraft/assets/15166432/c032f308-7fe2-468d-8e9f-9b12a02f6795

This is from my solver (which is just tweaked from musicgen_base_32khz.yaml:

generate:
  every: 25
  num_workers: 5
  path: samples
  audio:
    format: wav
    strategy: loudness
    sample_rate: ${sample_rate}
    loudness_headroom_db: 14
  lm:
    prompted_samples: true
    unprompted_samples: true
    gen_gt_samples: false
    prompt_duration: null   # if not set, will use dataset.generate.segment_duration / 4
    gen_duration: null      # if not set, will use dataset.generate.segment_duration
    remove_prompts: false
    # generation params
    use_sampling: false
    temp: 1.0
    top_k: 0
    top_p: 0.0

I see here that "prompt_duration" is segment_duration / 4, which I guess would explain the 8-ish second intro of reasonable music. How many epochs are people training on to get some kind of reasonable output? I'm just fine-tuning on fma_small for now, btw.

adefossez commented 10 months ago

on what duration are you training ? 30 seconds samples ?

jbmaxwell commented 10 months ago

Yeah, 30 seconds—so the 30 / 4 is my 8-ish (7.5) seconds of good output explained. But I think it's actually that I was inadvertently training from scratch, rather than fine-tuning. I updated the defaults section of my solver config to include - override /model/lm/model_scale: medium and it seems much better now. I guess I'd mistakenly assumed that the "default" training mode would be fine-tuning, so I hadn't specifically set the base model? That's a guess, since I haven't seen the intended behaviour documented anywhere.

Do you know where I can find documentation of all the solver config parameters/fields?

jbmaxwell commented 10 months ago

I'm also not clear on what is being indicated with the strings "unprompted_description" and "prompted_description" in the sample generations. Is that explained anywhere? (I don't see anything clear in the docs, the paper, or the comments in the musicgen.py solver code.)

eonglints commented 9 months ago

I'm interested in this too. In gen_unprompted_outputs, the prompt_duration is set to prompt_duration = dataset_duration / 4 rather than None. If I understand this correctly, "unprompted" samples are still being prompted, hence the high-quality first few seconds.