andabi / voice-vector

Deep neural networks for getting text-independent speaker embedding written in TensorFlow
MIT License
309 stars 82 forks source link

Text-independent voice vectors

Subtitle: which of the Hollywood stars is most similar to my voice?

Prologue

Everyone has their own voice. The same voice will not exist from different people, but some people have similar voices while others do not. This project aims to find individual voice vectors using VoxCeleb dataset, which contains 1,251 Hollywood stars' 145,379 utterances. The voice vectors are text-independent, meaning that any pair of utterances from same speaker has similar voice vectors. Also the closer the vector distance is, the more voices are similar.

Architectures

The architecture is based on a classification model. The utterance inputted is classified as one of the Hollywood stars. The objective function is simply a cross entropy between speaker labels from ground truth and predictions. Eventually, the last layer's activation becomes the speaker's embedding.

The model architecture is structured as follows.

  1. memory cell
    • CBHG module from Tacotron captures hidden features from sequential data.
  2. embedding
    • memory cell's last output is projected by the size of embedding vector.
  3. softmax
    • embedding is logits for each classes.

Training

Embedding

How to run?

Requirements

Future works

References