KakaruHayate / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
0 stars 0 forks source link

ConvFFT #1

Open KakaruHayate opened 4 weeks ago

KakaruHayate commented 4 weeks ago

详见分支:https://github.com/KakaruHayate/DiffSinger/tree/ConvFFT

代码:https://github.com/KakaruHayate/DiffSinger/blob/ConvFFT/modules/commons/common_layers.py

源自:Xiaoicesing 2: A High-Fidelity Singing Voice Synthesizer Based on Generative Adversarial Network

实现参考:https://github.com/CODEJIN/XiaoiceSing2/blob/master/Modules/Modules_Paper.py#L391

x2

1.DiffSinger和Xiaoicesing的fastspeech2 encoder实现本来就有差异(前者使用pre-norm而后者使用post-norm)(1102post-norm会NaN),故ConvFFT的整体实现与所提到论文有差异。其中Conv Block两端都有norm,故差异应该不大,Conv Block与MHSA的和是否需要norm待实验

2.CODEJIN的实现Conv Block是串联的,但按照论文的描述,应该和MHSA是并联的,而官方实现没有提供这一部分,

3.目前默认参数很可能会过拟合

# ConvFFT args:
use_conv_block: true # 是否使用ConvFFT
conv_block_kernel_size: 5 # Conv Block的卷积核大小,可调参数
conv_block_dropout_rate: 0.1 # Conv Block的dropout率,可调参数
conv_block_dilate: 4 # Conv Block的dim膨胀倍数,可调参数,应该需要调小
conv_block_layer: 2 # Conv Block的数量,可调参数,论文中提到了encoder部分需要两个
conv_block_add_norm: false # 是否对与MHSA相加的结果进行norm
KakaruHayate commented 3 weeks ago

1.post norm会nan

2.原版的实现(pre norm)会严重口胡

3.Diffsinger Encoder + Conv Block疑似提高自然度,待测试