Acoustic Decompostion - Githubissues

rmmal commented 7 years ago

Can someone explain to me what is exactly happening in the acoustic decomposition ? as an example , Now i have 180 value of MGC ( 60 , 60 delta , 60 acc. ) What i exactly do to make them just 60 ? and what if i took the first only 60 values would it make a lot of differences ? And where is the V/UV used ?

Thanks in advance

seblemaguer commented 7 years ago

if you took the only first values you will only keep the static coefficients and the rendering is not going to be good.

Normally, when you get the coefficients, you should apply the MLPG generation algorithm. You can have a look at this file: https://github.com/CSTR-Edinburgh/merlin/blob/master/src/frontend/parameter_generation.py

then, you get the "final" mgc/lf0/bap. You use SPTK and the vocoder of your choice (STRAIGHT, WORLD depending on what you used to extract your coefficient for the training) to render the signal from it

bajibabu commented 7 years ago

You can use just first 60 values for both training and testing as in this paper https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43266.pdf, final output layer should be LSTM layer to get a decent quality speech.

And V/UV values are used to make voiced and unvoiced decisions in F0 (fundamental frequency) values. In training, the F0 values are interpolated for unvoiced regions so we need to revert back based on V/UV values.

m-toman commented 7 years ago

Hi all.

I guess a final LSTM layer would be more performant when synthesizing than MLPG? Also MLPG can not be run on a single feature vector, right?

I've tried to compare the bandmat implementation in merlin with the MLPG implementations of flite and hts_engine - by chance, does anyone know the crucial differences? I'm trying to come up with a C/C++ implementation that works with the merlin features and variances. I assume the output features of the DNN are considered to be the mean values of a Gaussian and the variances stored in the vars folder are.. 1 variances for each feature dimension?

bajibabu commented 7 years ago

I guess a final LSTM layer would be more performant when synthesizing than MLPG? No, clear answer. It depends a lot on the duration of utterances.

Also MLPG can not be run on a single feature vector, right? Ture.

I've tried to compare the bandmat implementation in merlin with the MLPG implementations of flite and hts_engine - by chance, does anyone know the crucial differences? I would like to know your observation, haven't tried myself.

I'm trying to come up with a C/C++ implementation that works with the merlin features and variances. I assume the output features of the DNN are considered to be the mean values of a Gaussian and the variances stored in the vars folder are.. 1 variances for each feature dimension? Are you proposing something similar to this https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42020.pdf ?

m-toman commented 7 years ago

I'm comparing the LSTM variants with a regular feedforward DNN. I've exported the weights from theano to my own protobuf format and loaded them into a tiny-DNN (https://github.com/tiny-dnn/tiny-dnn) network. After some tinkering I got it to work and produce acoustic features that produce the desired speech when re-exporting to merlin. Currently I'm transforming only the static features for WORLD and started to look at the MLPG algorithm in Merlin...

So my observations until now: tiny-DNN is really easy to integrate, runs on most platforms in 32 and 64 bit. But I fear it will be too slow for live synthesis. And: It doesn't support RNNs/LSTMs yet. Adding the MLPG now will make it even slower. There might be a few options to work around that but.. yeah.

I ran the network also with tensorflow in python and it seemed much faster. But Tensorflow is quite a beast to get into e.g. a SAPI DLL on Windows ;). Also it typically comes only with 64 bit build files, while a lot of SAPI applications are 32 bit only. Integration in iOS and Android seems easy on the other hand.

With the LSTM you could probably run it frame by frame and run each frame through the vocoder, so doing a sort of streaming synthesis.

CSTR-Edinburgh / merlin

Acoustic Decompostion #97