How to check speaker disentanglement during training?

What I have done: I purposely set a 0-like speaker embedding vector during testing for both image representation and loss measure (MSE, I assume higher is better).

For the result, I can clearly observe a significant MSE (around 33) after few days of training. However, after doing the real voice conversion (from one speaker to another), the model only achieves reconstruction without voice conversion.

If possible, it would be really appreciated knowing if there exist other ways to test voice conversion during training.

Great Thanks.

auspicious3000 / SpeechSplit

How to check speaker disentanglement during training? #50