deepvk / emospeech

Apache License 2.0
98 stars 11 forks source link

PositionWise FeedForward conditioning bug #1

Closed d-cota closed 11 months ago

d-cota commented 11 months ago

Hi,

I may have spotted a bug in the calculation of the PositionWise FF layer. Here the output of the layer normalization is not used, since the output variable is overwritten by the self attention output transposition. In practice the model never sees the CCA output.

https://github.com/deepvk/emospeech/blob/4a25129eb1635edd27d6d8cf053695e6fdd08563/src/models/acoustic_model/transformer/layers.py#L30

dariadiatlova commented 11 months ago

Hi @d-cota,

This mistake occured while we were preparing the code release for the public version and refactor code due to the security reasons, internal version fortunately saw the CCA output. We're already working to fix the difference between 2 versions and will thoroughly double-check to ensure that the results obtained with the public code after the fix match those reported in the paper. Additionally, we'll be adding a table with metrics to this issue soon, so please stay tuned!

Thanks for noticing!

d-cota commented 11 months ago

Thanks @dariadiatlova!

dariadiatlova commented 11 months ago

Hi @d-cota,

The bug is fixed now, please find the updated version of the code in the main branch.

We've retrained all the models to compare the results reported using our internal model and the current public version of the code. Each model was trined for 50,000 steps, and for the evaluation, we selected the model checkpoint with the best NISQA TTS score on the validation set. Here are the results:

Model Reported Reproduced
Original 4.17 ± 0.57 4.17 ± 0.53
Reconstructed 4.11 ± 0.58 4.11 ± 0.55
Baseline 3.77 ± 0.74 3.85 ± 0.67
Model # 1 3.71 ± 0.76 3.86 ± 0.67
Model # 2 3.93 ± 0.66 3.92 ± 0.62
Model # 3 3.95 ± 0.66 3.89 ± 0.64
EmoSpeech 4.1 ± 0.58 4.09 ± 0.57

While the reported and reproduced metrics do not align precisely, the differences remain within the standard deviation boundaries. It's important to note that in the reported metrics, each modification led to an increase in NISQA TTS, except when transitioning from the Baseline to Model # 1. In the reproduced setup, we observed the same trend, with the only exception being that NISQA TTS drops when transitioning from Model # 2 to Model # 3, which is expected. In our paper, we mentioned that the addition of CCA along with the eGeMAPS Predictor and CLN increased the emotionality of the audio but also resulted in quality degradation and voice artifacts. We addressed this issue by introducing adversarial training to smoothen it out. NISQA TTS focuses primarily on assessing the naturalness of speech, with a significant emphasis on identifying and addressing voice artifacts, sometimes at the expense of evaluating emotional expressiveness.

To ensure precise reproducibility, we will update the metrics in the next version of the paper. We appreciate your input in identifying this problem. Thank you!