Open Xiaoyiyong555 opened 4 months ago
Exactly what I came here for
I am currently installing this and trying it out. If I find a way (or maybe there already is a way) I will tell you
Exactly what I came here for
I am currently installing this and trying it out. If I find a way (or maybe there already is a way) I will tell you
Insteresting ! Looking forward to your good news !
Exactly what I came here for
I am currently installing this and trying it out. If I find a way (or maybe there already is a way) I will tell you
Looking forward to your experimental results.
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c
https://github.com/user-attachments/assets/06f7d3fc-b728-4e7b-98f0-6bdef48cd846
Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间'
https://github.com/user-attachments/assets/e02e81bc-d09b-41f2-b4e9-ea85746703f4
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall"
out1.mp4 out2.mp4 it's so great! Is the model size reach 1B? How did you introduce lyrics condition into the model?
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" out1.mp4 out2.mp4 it's so great! Is the model size reach 1B? How did you introduce lyrics condition into the model?
Yes. The model size is same as stability-AI. Lyric condition and CLAP included via cross-attention.
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" out1.mp4 out2.mp4 it's so great! Is the model size reach 1B? How did you introduce lyrics condition into the model?
Yes. The model size is same as stability-AI. Lyric condition and CLAP included via cross-attention.
how did you token the lycris word? did you input the lycris into the clap or use T5 like llm?
Great try! May I ask how many songs are approximately in the small dataset you mentioned? I've found that the lyrics are not entirely incorrect, but there is an issue with the positioning of the lyrics. Have you added positional information for each phoneme?
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c
out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间'
out.mp4
yoooo this is really cool, I would like to finetune on that! Do you have a repo or is it your own closed source thing?
Great try! May I ask how many songs are approximately in the small dataset you mentioned? I've found that the lyrics are not entirely incorrect, but there is an issue with the positioning of the lyrics. Have you added positional information for each phoneme?
I use the ROPE implemented on this repo. maybe I should use the abs positional embedding.
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4
yoooo this is really cool, I would like to finetune on that! Do you have a repo or is it your own closed source thing?
I used the code from this repository but made some changes. My code is quite chaos. Once I resolve the pronunciation issues. I'll consider open-sourcing it. But overall, it's quite similar to the repo of Stability AI.
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4
yoooo this is really cool, I would like to finetune on that! Do you have a repo or is it your own closed source thing?
I used the code from this repository but made some changes. My code is quite chaos. Once I resolve the pronunciation issues. I'll consider open-sourcing it. But overall, it's quite similar to the repo of Stability AI.
So cooool! I am really curious about your training dataset format and size. So did you just use txt files or more detailed infos (like timestamp, etc.)? And what size of your dataset can make you to achieve such an interesting quality. Again, so amazing.
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4
yoooo this is really cool, I would like to finetune on that! Do you have a repo or is it your own closed source thing?
I used the code from this repository but made some changes. My code is quite chaos. Once I resolve the pronunciation issues. I'll consider open-sourcing it. But overall, it's quite similar to the repo of Stability AI. I'm very interested in your work. In fact, I'm doing Similar jobs. Can you leave your email address?We can share related work with each other.
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4
yoooo this is really cool, I would like to finetune on that! Do you have a repo or is it your own closed source thing?
I used the code from this repository but made some changes. My code is quite chaos. Once I resolve the pronunciation issues. I'll consider open-sourcing it. But overall, it's quite similar to the repo of Stability AI.
So cooool! I am really curious about your training dataset format and size. So did you just use txt files or more detailed infos (like timestamp, etc.)? And what size of your dataset can make you to achieve such an interesting quality. Again, so amazing.
My dataset is very small, about 100 hours, including several open source data. I used the CLAP condition, lriyc and other condition talked in Stability AI's paper.
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4
yoooo this is really cool, I would like to finetune on that! Do you have a repo or is it your own closed source thing?
I used the code from this repository but made some changes. My code is quite chaos. Once I resolve the pronunciation issues. I'll consider open-sourcing it. But overall, it's quite similar to the repo of Stability AI.
So cooool! I am really curious about your training dataset format and size. So did you just use txt files or more detailed infos (like timestamp, etc.)? And what size of your dataset can make you to achieve such an interesting quality. Again, so amazing.
My dataset is very small, about 100 hours, including several open source data. I used the CLAP condition, lriyc and other condition talked in Stability AI's paper.
Thanks!
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c
out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间'
out.mp4
Which version do you use, 1.0 or 2.0 ?
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4
Which version do you use, 1.0 or 2.0 ?
I used the version 2.0
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4
Which version do you use, 1.0 or 2.0 ?
I used the version 2.0
Thanks. How many epochs the model is trained ? The result sounds really cool.
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4
Which version do you use, 1.0 or 2.0 ?
I used the version 2.0
Thanks. How many epochs the model is trained ? The result sounds really cool.
I tranied it not too much longer. About 20-30 epoch, in fact, the demo sounds not good enough.
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4
Which version do you use, 1.0 or 2.0 ?
I used the version 2.0
Thanks. How many epochs the model is trained ? The result sounds really cool.
I tranied it not too much longer. About 20-30 epoch, in fact, the demo sounds not good enough.
I am trying to add lyrics support too, but I don't really understand how to add anything really. the codebase is big based on what I work on.
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4
Which version do you use, 1.0 or 2.0 ?
I used the version 2.0
Thanks. How many epochs the model is trained ? The result sounds really cool.
I tranied it not too much longer. About 20-30 epoch, in fact, the demo sounds not good enough.
I am trying to add lyrics support too, but I don't really understand how to add anything really. the codebase is big based on what I work on.
I used the version2.0. Diffusion transformer has a cross attention input.
Great job! How do you align text with pronunciation? Traditional singing synthesis requires this alignment step, but it seems that version2.0 doesn't implement this logic. Is it necessary to implement an alignment module?
yes. I train my own model on a small song dataset. The demo is here. The bgm sounds not bad. But the vocals are totally wrong. It confuses me. Lyric is "Because maybe /t You're gonna be the one that saves me \t And after all \t You're my wonderwall \t Because maybe \t You're gonna be the one that saves me \t And after all \t You're my wonderwall" English: https://github.com/user-attachments/assets/3b75632d-1652-4e67-93ce-c71218265a0c out2.mp4 Chinese: lyric: 明月几时有 \t 把酒问青天 \t 不知天上宫阙 \t 今夕是何年 \t 我欲乘风归去 \t 又恐琼楼玉宇 \t 高处不胜寒 \t 起舞弄清影 \t 何似在人间' out.mp4
Which version do you use, 1.0 or 2.0 ?
I used the version 2.0
Thanks. How many epochs the model is trained ? The result sounds really cool.
I tranied it not too much longer. About 20-30 epoch, in fact, the demo sounds not good enough.
I am trying to add lyrics support too, but I don't really understand how to add anything really. the codebase is big based on what I work on.
I used the version2.0. Diffusion transformer has a cross attention input.
Did you train it from scratch or finetune it based on the released model ?
train it from scratch
I trained these models from scratch. Including CLAP, autoencoder and DIT.
Great job! How do you align text with pronunciation? Traditional singing synthesis requires this alignment step, but it seems that version2.0 doesn't implement this logic. Is it necessary to implement an alignment module?
I don't have alignment information. I just use the lyric. Maybe I should try it.
I have got text-to-speech to kinda work with CLAP in cross attention. Not music yet
Great work! I'd also like to ask if you've tried using lyrics as input to generate the corresponding singing, similar to the suno approach. Does the DIT structure support this form of generation?