Possibility of the generate songs like suno

Xiaoyiyong555 commented 4 months ago

Great work! I'd also like to ask if you've tried using lyrics as input to generate the corresponding singing, similar to the suno approach. Does the DIT structure support this form of generation?

GoombaProgrammer commented 4 months ago

Exactly what I came here for

I am currently installing this and trying it out. If I find a way (or maybe there already is a way) I will tell you

xianshenglee commented 4 months ago

Insteresting ! Looking forward to your good news !

qiao131 commented 3 months ago

Looking forward to your experimental results.

z592694590 commented 2 months ago

yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c

https://github.com/user-attachments/assets/06f7d3fc-b728-4e7b-98f0-6bdef48cd846

Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间'

https://github.com/user-attachments/assets/e02e81bc-d09b-41f2-b4e9-ea85746703f4

Xiaoyiyong555 commented 2 months ago

z592694590 commented 2 months ago

Yes. The model size is same as stability-AI. Lyric condition and CLAP included via cross-attention.

Xiaoyiyong555 commented 2 months ago

how did you token the lycris word? did you input the lycris into the clap or use T5 like llm?

qiao131 commented 2 months ago

Great try! May I ask how many songs are approximately in the small dataset you mentioned? I've found that the lyrics are not entirely incorrect, but there is an issue with the positioning of the lyrics. Have you added positional information for each phoneme?

GoombaProgrammer commented 2 months ago

yoooo this is really cool, I would like to finetune on that! Do you have a repo or is it your own closed source thing?

z592694590 commented 2 months ago

I use the ROPE implemented on this repo. maybe I should use the abs positional embedding.

z592694590 commented 2 months ago

I used the code from this repository but made some changes. My code is quite chaos. Once I resolve the pronunciation issues. I'll consider open-sourcing it. But overall, it's quite similar to the repo of Stability AI.

wbs2788 commented 2 months ago

So cooool! I am really curious about your training dataset format and size. So did you just use txt files or more detailed infos (like timestamp, etc.)? And what size of your dataset can make you to achieve such an interesting quality. Again, so amazing.

Xiaoyiyong555 commented 2 months ago

z592694590 commented 2 months ago

My dataset is very small, about 100 hours, including several open source data. I used the CLAP condition, lriyc and other condition talked in Stability AI's paper.

wbs2788 commented 2 months ago

Thanks!

wujian752 commented 2 months ago

Which version do you use, 1.0 or 2.0 ?

z592694590 commented 2 months ago

I used the version 2.0

wujian752 commented 2 months ago

Thanks. How many epochs the model is trained ? The result sounds really cool.

z592694590 commented 2 months ago

I tranied it not too much longer. About 20-30 epoch, in fact, the demo sounds not good enough.

GoombaProgrammer commented 2 months ago

I am trying to add lyrics support too, but I don't really understand how to add anything really. the codebase is big based on what I work on.

z592694590 commented 2 months ago

I used the version2.0. Diffusion transformer has a cross attention input.

vereon-utb commented 1 month ago

wujian752 commented 1 month ago

Did you train it from scratch or finetune it based on the released model ?

z592694590 commented 1 month ago

I trained these models from scratch. Including CLAP, autoencoder and DIT.

z592694590 commented 1 month ago

I don't have alignment information. I just use the lyric. Maybe I should try it.

GoombaProgrammer commented 17 hours ago

I have got text-to-speech to kinda work with CLAP in cross attention. Not music yet

Stability-AI / stable-audio-tools

Possibility of the generate songs like suno #65