KdaiP / StableTTS

Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3
MIT License
348 stars 39 forks source link

推理的声音质量? #6

Closed juntaosun closed 4 weeks ago

juntaosun commented 6 months ago

很棒的项目,我训练后可以正常推理,发音也正常。 但和训练素材对比,音质听起来不是很明亮和清脆,(我确认不是训练素材质量问题)。

检查了训练素材音频采样率和配置保持一致 44100 。 如何改善推理的音质呢? 再次感谢~

lucasjinreal commented 5 months ago

推理速度咋样啊

GavinZhao19 commented 5 months ago

推理速度咋样啊

推理速度挺不错的,我自己测的就用普通的T4的gpu,32个单词的句子说出来15s,推理速度250ms,很快。

juntaosun commented 5 months ago

推理速度咋样啊

推理速度挺不错的,我自己测的就用普通的T4的gpu,32个单词的句子说出来15s,推理速度250ms,很快。

推理速度确实不错,唯一的问题就是音质,虽然是44100,但实际上有点像是电话语音音质,你们也是这样吗?

GavinZhao19 commented 5 months ago

推理速度咋样啊

推理速度挺不错的,我自己测的就用普通的T4的gpu,32个单词的句子说出来15s,推理速度250ms,很快。

推理速度确实不错,唯一的问题就是音质,虽然是44100,但实际上有点像是电话语音音质,你们也是这样吗?

可能跟step设的也有关,项目结构是diffusion transformer,我设置step的话15比5好,但是step多了速度会慢一些。目前15的话也确实声音质量有点电话音质,可以试试设置step多一点,或者等作者更新更好的base model

GavinZhao19 commented 5 months ago

推理速度咋样啊

推理速度挺不错的,我自己测的就用普通的T4的gpu,32个单词的句子说出来15s,推理速度250ms,很快。

推理速度确实不错,唯一的问题就是音质,虽然是44100,但实际上有点像是电话语音音质,你们也是这样吗?

我今天也训了一个,确实推理的声音质量差一些,估计要等更好的pretrained model了

KdaiP commented 5 months ago

新的模型已经在训了╮(╯▽╰)╭ 目前也在尝试其他架构,看看能不能在不提升参数量的情况下提升音质o(~▽~)d

GavinZhao19 commented 5 months ago

新的模型已经在训了╮(╯▽╰)╭ 目前也在尝试其他架构,看看能不能在不提升参数量的情况下提升音质o(~▽~)d

目前,推理性能非常不错。感觉是不是可以增加点参数,搞个,小杯,中杯,大杯。降低推理速度的情况下,看看性能提升啥的。

KdaiP commented 5 months ago

新的模型已经在训了╮(╯▽╰)╭ 目前也在尝试其他架构,看看能不能在不提升参数量的情况下提升音质o(~▽~)d

目前,推理性能非常不错。感觉是不是可以增加点参数,搞个,小杯,中杯,大杯。降低推理速度的情况下,看看性能提升啥的。

@GavinZhao19 我试了下加参到78M参数,训了一晚上效果比10M参数训5天好上不少。后续确实可以训练几个不同参数的版本

juntaosun commented 5 months ago

最近有更新吗?

KdaiP commented 5 months ago

最近有更新吗?

前两天发现线性频谱变换写错了,导致声音不佳,修正后音质有了很大提升。

由于是预处理时出现的错误,目前正在重新训练声学模型和声码器,大概还要1-2周左右

修正后的频谱参数会与vocoder完全相同

juntaosun commented 5 months ago

最近有更新吗?

前两天发现线性频谱变换写错了,导致声音不佳,修正后音质有了很大提升。

由于是预处理时出现的错误,目前正在重新训练声学模型和声码器,大概还要1-2周左右

修正后的频谱参数会与vocoder完全相同

等你更新后测试一下。

xinkez commented 4 months ago

最近有更新吗?

前两天发现线性频谱变换写错了,导致声音不佳,修正后音质有了很大提升。

由于是预处理时出现的错误,目前正在重新训练声学模型和声码器,大概还要1-2周左右

修正后的频谱参数会与vocoder完全相同

你好,我对比了代码根目录下的config.py和vocos_pytorch目录中相关参数,没看出你提到的错误?想请教一下,谢谢

juntaosun commented 1 month ago

这个项目还在活动吗?

ILG2021 commented 1 month ago

思路很先进,只是质量还有待提高,期待有能与elevenlabs相当的开源tts出现。

albluc24 commented 1 month ago

Hi, I saw that it was discovered a bug in the linear conversion of the spectrogram. There isn't any PR or commit or issue explaining it further. I tried comparing the implementations and parameters with the provided link but I saw nothing amiss. @KdaiP Could you shed some light on this?

juntaosun commented 1 month ago

Hi, I saw that it was discovered a bug in the linear conversion of the spectrogram. There isn't any PR or commit or issue explaining it further. I tried comparing the implementations and parameters with the provided link but I saw nothing amiss. @KdaiP Could you shed some light on this?

Will this configuration fix the sound quality issue?

albluc24 commented 1 month ago

TBH I have no idea. I am not a chinese speaker at all so I translated what the author sayd and tryed to piece together something. I don't think that my fix is what they intended as it is too shallow. I was planning to use this architecture but I have limited machine resources ATM, and having no guarantee that the fix works I probably will look at something else. If you are willing to test it, maybe you could try something even on a small scale if you're in a better situation than me?

KdaiP commented 3 weeks ago

这个项目还在活动吗?

经过四个月的实验,模型已更新

KdaiP commented 3 weeks ago

TBH I have no idea. I am not a chinese speaker at all so I translated what the author sayd and tryed to piece together something. I don't think that my fix is what they intended as it is too shallow. I was planning to use this architecture but I have limited machine resources ATM, and having no guarantee that the fix works I probably will look at something else. If you are willing to test it, maybe you could try something even on a small scale if you're in a better situation than me?

Hi, the new model has been updated!