Stability-AI / stable-audio-tools

Generative models for conditional audio generation
MIT License
2.57k stars 240 forks source link

Possibility of the generate songs like suno #65

Open Xiaoyiyong555 opened 4 months ago

Xiaoyiyong555 commented 4 months ago

Great work! I'd also like to ask if you've tried using lyrics as input to generate the corresponding singing, similar to the suno approach. Does the DIT structure support this form of generation?

GoombaProgrammer commented 4 months ago

Exactly what I came here for

I am currently installing this and trying it out. If I find a way (or maybe there already is a way) I will tell you

xianshenglee commented 4 months ago

Exactly what I came here for

I am currently installing this and trying it out. If I find a way (or maybe there already is a way) I will tell you

Insteresting ! Looking forward to your good news !

qiao131 commented 3 months ago

Exactly what I came here for

I am currently installing this and trying it out. If I find a way (or maybe there already is a way) I will tell you

Looking forward to your experimental results.

z592694590 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c

https://github.com/user-attachments/assets/06f7d3fc-b728-4e7b-98f0-6bdef48cd846

Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间'

https://github.com/user-attachments/assets/e02e81bc-d09b-41f2-b4e9-ea85746703f4

Xiaoyiyong555 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall"

out1.mp4 out2.mp4 it's so great! Is the model size reach 1B? How did you introduce lyrics condition into the model?

z592694590 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" out1.mp4 out2.mp4 it's so great! Is the model size reach 1B? How did you introduce lyrics condition into the model?

Yes. The model size is same as stability-AI. Lyric condition and CLAP included via cross-attention.

Xiaoyiyong555 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" out1.mp4 out2.mp4 it's so great! Is the model size reach 1B? How did you introduce lyrics condition into the model?

Yes. The model size is same as stability-AI. Lyric condition and CLAP included via cross-attention.

how did you token the lycris word? did you input the lycris into the clap or use T5 like llm?

qiao131 commented 2 months ago

Great try! May I ask how many songs are approximately in the small dataset you mentioned? I've found that the lyrics are not entirely incorrect, but there is an issue with the positioning of the lyrics. Have you added positional information for each phoneme?

GoombaProgrammer commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c

out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间'

out.mp4

yoooo this is really cool, I would like to finetune on that! Do you have a repo or is it your own closed source thing?

z592694590 commented 2 months ago

Great try! May I ask how many songs are approximately in the small dataset you mentioned? I've found that the lyrics are not entirely incorrect, but there is an issue with the positioning of the lyrics. Have you added positional information for each phoneme?

I use the ROPE implemented on this repo. maybe I should use the abs positional embedding.

z592694590 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4

yoooo this is really cool, I would like to finetune on that! Do you have a repo or is it your own closed source thing?

I used the code from this repository but made some changes. My code is quite chaos. Once I resolve the pronunciation issues. I'll consider open-sourcing it. But overall, it's quite similar to the repo of Stability AI.

wbs2788 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4

yoooo this is really cool, I would like to finetune on that! Do you have a repo or is it your own closed source thing?

I used the code from this repository but made some changes. My code is quite chaos. Once I resolve the pronunciation issues. I'll consider open-sourcing it. But overall, it's quite similar to the repo of Stability AI.

So cooool! I am really curious about your training dataset format and size. So did you just use txt files or more detailed infos (like timestamp, etc.)? And what size of your dataset can make you to achieve such an interesting quality. Again, so amazing.

Xiaoyiyong555 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4

yoooo this is really cool, I would like to finetune on that! Do you have a repo or is it your own closed source thing?

I used the code from this repository but made some changes. My code is quite chaos. Once I resolve the pronunciation issues. I'll consider open-sourcing it. But overall, it's quite similar to the repo of Stability AI. I'm very interested in your work. In fact, I'm doing Similar jobs. Can you leave your email address?We can share related work with each other.

z592694590 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4

yoooo this is really cool, I would like to finetune on that! Do you have a repo or is it your own closed source thing?

I used the code from this repository but made some changes. My code is quite chaos. Once I resolve the pronunciation issues. I'll consider open-sourcing it. But overall, it's quite similar to the repo of Stability AI.

So cooool! I am really curious about your training dataset format and size. So did you just use txt files or more detailed infos (like timestamp, etc.)? And what size of your dataset can make you to achieve such an interesting quality. Again, so amazing.

My dataset is very small, about 100 hours, including several open source data. I used the CLAP condition, lriyc and other condition talked in Stability AI's paper.

wbs2788 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4

yoooo this is really cool, I would like to finetune on that! Do you have a repo or is it your own closed source thing?

I used the code from this repository but made some changes. My code is quite chaos. Once I resolve the pronunciation issues. I'll consider open-sourcing it. But overall, it's quite similar to the repo of Stability AI.

So cooool! I am really curious about your training dataset format and size. So did you just use txt files or more detailed infos (like timestamp, etc.)? And what size of your dataset can make you to achieve such an interesting quality. Again, so amazing.

My dataset is very small, about 100 hours, including several open source data. I used the CLAP condition, lriyc and other condition talked in Stability AI's paper.

Thanks!

wujian752 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c

out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间'

out.mp4

Which version do you use, 1.0 or 2.0 ?

z592694590 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4

Which version do you use, 1.0 or 2.0 ?

I used the version 2.0

wujian752 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4

Which version do you use, 1.0 or 2.0 ?

I used the version 2.0

Thanks. How many epochs the model is trained ? The result sounds really cool.

z592694590 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4

Which version do you use, 1.0 or 2.0 ?

I used the version 2.0

Thanks. How many epochs the model is trained ? The result sounds really cool.

I tranied it not too much longer. About 20-30 epoch, in fact, the demo sounds not good enough.

GoombaProgrammer commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4

Which version do you use, 1.0 or 2.0 ?

I used the version 2.0

Thanks. How many epochs the model is trained ? The result sounds really cool.

I tranied it not too much longer. About 20-30 epoch, in fact, the demo sounds not good enough.

I am trying to add lyrics support too, but I don't really understand how to add anything really. the codebase is big based on what I work on.

z592694590 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4

Which version do you use, 1.0 or 2.0 ?

I used the version 2.0

Thanks. How many epochs the model is trained ? The result sounds really cool.

I tranied it not too much longer. About 20-30 epoch, in fact, the demo sounds not good enough.

I am trying to add lyrics support too, but I don't really understand how to add anything really. the codebase is big based on what I work on.

I used the version2.0. Diffusion transformer has a cross attention input.

vereon-utb commented 1 month ago

Great job! How do you align text with pronunciation? Traditional singing synthesis requires this alignment step, but it seems that version2.0 doesn't implement this logic. Is it necessary to implement an alignment module?

wujian752 commented 1 month ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4

Which version do you use, 1.0 or 2.0 ?

I used the version 2.0

Thanks. How many epochs the model is trained ? The result sounds really cool.

I tranied it not too much longer. About 20-30 epoch, in fact, the demo sounds not good enough.

I am trying to add lyrics support too, but I don't really understand how to add anything really. the codebase is big based on what I work on.

I used the version2.0. Diffusion transformer has a cross attention input.

Did you train it from scratch or finetune it based on the released model ?

z592694590 commented 1 month ago

train it from scratch

I trained these models from scratch. Including CLAP, autoencoder and DIT.

z592694590 commented 1 month ago

Great job! How do you align text with pronunciation? Traditional singing synthesis requires this alignment step, but it seems that version2.0 doesn't implement this logic. Is it necessary to implement an alignment module?

I don't have alignment information. I just use the lyric. Maybe I should try it.

GoombaProgrammer commented 17 hours ago

I have got text-to-speech to kinda work with CLAP in cross attention. Not music yet