flrngel / understanding-ai

personal repository

36 stars 6 forks source link

Neural Voice Cloning with a Few Samples #8

Open flrngel opened 6 years ago

flrngel commented 6 years ago

https://arxiv.org/abs/1802.06006 Paper from Baidu Research

Abstract

Paper will do

Speaker adaption
- fine-tuning a multi-speaker generative model
Speaker encoding
- infer speaker embedding which will be used with a multi-speaker generative model

1. Introduction

Text carries linguistic information
Speaker representation captures speaker's characteristics (pitch, speech rate, accent)
This paper focuses on voice cloning
Compares speech naturalness, speaker similarity, cloning/inference time, model footprint

2. Voice Cloning

Paper Notations

f: multi-speaker generative model
g: speaker encoding function
t: text
s: speaker
a: audio
S: speaker set
A: audio set

2.1. Speaker adaption

Speaker adaption function

2.2. Speaker encoding

Speaker encoding function

Paper avoids mode collapse with training speaker encoder seperately

Loss function (L1)

Architecture

Spectral processing
Temporal processing
Cloning sample attention
- uses multi-head self-attention from Transformer

2.3. Discriminative models for evaluation

Because human is so expensive, paper propose those two solutions for evaluation

2.3.1. Speaker Classification

Put additional embedding layer before softmax function from whole architecture

2.3.2. Speaker Verification

binary classification wheter the test audio and enrolled audio are same speaker

Experiments

3.1. Datasets

LibriSpeech dataset for multi-speaker generative model & speaker encoder model
sampling from VCTK for voice cloning