Closed steven850 closed 1 year ago
As training goes by the synthesis ability of the model will improve and the synthesized mels will have less over-smoothness problem. 100k steps (batchsize 64) is still in an early training stage.
And, yes this data_utils.py
(which follows the logic of my uncleaned code) uses much shorter segments, that's why I wrote the old problematic data_utils.py
(which concatenates short segments so that a batch contains more speech data). But you know several days ago I found that this concatenation is harmful for speech quality, although it converges faster (reduces over-smoothness problem faster) due to more data in a batch.
Since you made the changes to the data_utils the other day I am noticing that it seems to be training on much shorter segments.
the eval audios are also all below 1 second long, most of them just 40 frames. also noticing some obvious gaps/blurs in the generated/GT mels. Can see the "Blurry spots" in these generated mels. where the formants just disappear.