Stability-AI / stable-audio-tools

Generative models for conditional audio generation
MIT License
2.67k stars 252 forks source link

Loss about vae and diffusion #63

Open 980202006 opened 5 months ago

980202006 commented 5 months ago

I am training VAE and stable audio2 models from scratch, how much will VAE and Diffusion loss reach?My current VAE loss is about 0.4, and diffusion’s mse loss is 0.53.

BingliangLi commented 5 months ago

You VAE loss is really low, may I ask how many steps do you train the VAE? I'm currently training the VAE for about 300000 steps and the loss is about 3.5. I'm using a quite large dataset.

980202006 commented 5 months ago

Sorry, I remembered it wrong. VAE loss is about 1.0, but the effect is still not good enough. image Can you tell me about your data volume and distribution? Maybe it's a coding problem?

BingliangLi commented 5 months ago

I'm using the stable_audio_2_0_vae.json config, i didn't change the model. I think maybe your dataset is not large enough? I'm using Audioset + VGGSound + mtg-jamendo + BBCSoundEffect +CommonVoice + free_to_use_sound and a bunch of other smaller dataset, here is my sample:

https://github.com/Stability-AI/stable-audio-tools/assets/49446651/267ec9b9-3b93-474f-ae75-3a30963491d6

980202006 commented 5 months ago

It is true that your data set is more complex and the loss should be larger. Is there a demo of the reconstruction effect?

BingliangLi commented 5 months ago

It is true that your data set is more complex and the loss should be larger. Is there a demo of the reconstruction effect?

The video in my above comment is the demo, I think it's quite good.

980202006 commented 5 months ago

Yes! This reconstruction works very well. What is the loss of the diffusion model you trained?

BingliangLi commented 5 months ago

I will start to train the diffusion model in the next few days, will update my status in this issue :)

980202006 commented 5 months ago

Waiting for your good news!

nateraw commented 5 months ago

Thank you for sharing @BingliangLi ❤️

PeiwenSun2000 commented 5 months ago

Would you kindly share your model config and dataset config for beginners' reference? I would really appreciate it!!

xianshenglee commented 4 months ago

Sorry, I remembered it wrong. VAE loss is about 1.0, but the effect is still not good enough. image Can you tell me about your data volume and distribution? Maybe it's a coding problem?

my diffusion's mse loss is 0.8 and I have the same concern with you... could you please also share your diffusion loss curve figure ? Thank you.

Alidaling commented 4 months ago

我使用的是 stable_audio_2_0_vae.json 配置,我没有更改模型。我想也许你的数据集不够大?我使用的是 Audioset + VGGSound + mtg-jamendo + BBCSoundEffect +CommonVoice + free_to_use_sound 和一堆其他较小的数据集,这是我的示例:

侦察_00326001.mp4

This reconstruction works very well,How many GPUs did you use for training, and what was the batch size?

BingliangLi commented 4 months ago

我使用的是 stable_audio_2_0_vae.json 配置,我没有更改模型。我想也许你的数据集不够大?我使用的是 Audioset + VGGSound + mtg-jamendo + BBCSoundEffect +CommonVoice + free_to_use_sound 和一堆其他较小的数据集,这是我的示例: 侦察_00326001.mp4

This reconstruction works very well,How many GPUs did you use for training, and what was the batch size?

I trained the model for 1000000 stpes with 8 A100 80G, with batch size set to 16, however the recon result is good enough around 500000 steps. I haven't start to train the diffusion yet, but I will do it before July.

cvillela commented 2 months ago

hey there @BingliangLi , did you endup training the diffusion model? Great results! I would love to know what parameters you used