CSTR-Edinburgh / merlin

This is now the official location of the Merlin project.
http://www.cstr.ed.ac.uk/projects/merlin/
Apache License 2.0
1.31k stars 441 forks source link

Improve Synthesis Time By Multiprocessing #347

Open chazo1994 opened 6 years ago

chazo1994 commented 6 years ago

In my speech synthesis system built from Merlin toolkit, it take long time to generate speech from text. Most of the time used by World Vocoder and DNN generation module. So, to improve time delay, I propose an idea as follow: in DNN decode (using theano), we generate acoustic features frame by frame (to online streaming), and push that features (of each frame) to vocoder to generate speech signal of frame. We do it in parallel (using multicore). Finally, we concatenate them to be a speech signal of entire input text. All of my idea is intergrate World vocoder and Acoustic DNN generation module to process the frame separately in multicore of multiprocessing. but How to generate feature of each frame with theano separatly with theano?

is My idea possible? How can I do it with theano and World vocoder? how to generate feature of each frame with theano separatly with theano? . Does anyone suggest me some solution to deploy this idea or improve synthesis time?.

trunglebka commented 6 years ago

i have same problem

maituan commented 6 years ago

this is exactly what have been thought about a month ago. I would gratetly appriciate if someone give some suggestions

m-toman commented 6 years ago

Are you using feedforward or recurrent networks?

A few things to consider: Theano, when running on the CPU, typically parallelizes the matrix multiplications (http://deeplearning.net/software/theano/tutorial/multi_cores.html).

You can run MLPG as well as world in chunks and stream audio back, but have to be a bit careful.

Writing and reading intermediate files is costly.

The MGC to spectrum function of SPTK is not optimized for this scenario. You can find a couple optimizations for SPTK in my blog posts here: http://www.neuratec.com/blog/?p=130 http://www.neuratec.com/blog/?p=99

chazo1994 commented 6 years ago

@m-toman Of course, I use feedforward neural network. I will consider your opinion carefully. thank you so much. I'm not sure, the conversion of MGC of each chunk to spectrum is right? it will generate right spectrum of speech? And follow your opinion, My idea can be implemented, right? (online streaming)

felipeespic commented 6 years ago

The WORLD vocoder and SPTK are not optimised for streaming. You may end up allocating and deallocating a lot of memory on each frame processing.

m-toman commented 6 years ago

WORLD has a streaming method implemented, but it's not really documented. You can find an example here: https://github.com/mmorise/World/blob/master/test/test.cpp#L428-L438 But it's true this won't help you very much when you just call it from the command line.

Similarly with SPTK, it's quite a pain to fix all the memory leaks because it assumes to be run from the command line and then quit again (for example, lot of memory for windows is allocated and the only pointers moved to the middle of the block and never freed).

So what I do is: Generate durations for 1-5 phones. Generate acoustic features according to the durations. For this chunk, transform features (mcg2sp etc.), MLPG and then "realtime" WORLD. Stream back the waveform Repeat.

That's a full C++ implementation, but perhaps you can find some middle-ground.

chazo1994 commented 6 years ago

@m-toman I intergrated Merlin, SPTK, WORLD in a program and transfer variables between them directly, I don't write any file. I also fix memory leak of SPTK.

As I understand, Your solution is streaming with 1-5 phones in each processing, right? You generate speech chunks, which corresponding to 1-5 phones, and concatenate it (or stream it back the waveform directly). But I'm not sure the concatenated speech will has good voice quality, especially at the juncture between chunks.

My idea, is generate acoustic features according to each frame (in acoustic generation phase) and ... "For this chunk, transform features (mcg2sp etc.), MLPG and then "realtime" WORLD Stream back the waveform " . My idea base on a theory that the input features are mapped to output features (frame by frame) by a trained DNN using feedforwrad, follow this paper statistical parametrics speech synthesis. And I want to improve time performance by generate speech signal in a frame corresponding to frame input of DNN and do it in parallel (multicore). My idea is expressed in the figure bellow: parallel_tts

m-toman commented 6 years ago

Well, I simplified the description a bit - I don't treat the 1-5 phones as a full utterance. I take the complete utterance and generate the HTS labels for them, then chunk them. So all contextual features (number of words in utterance etc.) are correct. This also means the DNN input features are the same, chunked or not and it doesn't matter if it's a nxm matrix or a 5xm matrix you feed in at a time.

The steps that follow afterwards are more crucial and I more tricky to parallelize. It's tricky with MLPG, but you can run MLPG for MGC, F0 and BAP in parallel. But here it's the same, you reset the parameters of the MLPG only when a new utterance begins (this is tricky with the merlin implementation but you can rather easily "pause" the SPTK implementation, return stuff and then "resume" when the next features come in.

Converting features can easily be parallelized. Then for WORLD it's probably tricky again to parallelize that.

But to be honest, I don't need that at all. A test sentence "This is a test" on my notebook requires: 50ms for the DNN (smaller DNN though so it runs on older devices and mobile) 24ms BAP MLPG 15ms LF0 MLPG 106ms MGC MLPG 29ms feature transformation 170ms WORLD (48kHz)

In total (with stuff in-between) 450 ms for a 1.2 second utterance. Actually I wonder if with an optimized DNN library (that's my own implementation of the forward pass) a BLSTM layer might even be faster than MLPG.

chazo1994 commented 6 years ago

@m-toman Actually, I already chunked the label file ( after finish the generation of the HTS labels) and generate speech for each chunk without your tricky( with MLPG). But after concatenate waveform, I receive bad quality of voice with noise in the junctures between speech chunks (concatenation is not smooth). I'm not sure your way can be better, because all of the function like Post filter, Vocoder,... are non linear function so the concatenation of chunks of sound is still a big problem. The result may be distorted.

m-toman commented 6 years ago

Yeah I don't run chunks in parallel, that way the quality is basically the same. So the last features of the previous chunk are available when I start processing the next one. Just wanted to state that I do a bit of parallel processing inside each chunk (therefore better when the chunks are a bit larger) and that generally this is fast enough on most mobile devices for live synthesis.

felipeespic commented 6 years ago

WORLD has a streaming method implemented, but it's not really documented. You can find an example here: https://github.com/mmorise/World/blob/master/test/test.cpp#L428-L438 But it's true this won't help you very much when you just call it from the command line.

@m-toman Cool, I didn't know that WORLD had streaming mode implemented.

megazone87 commented 5 years ago

Hi, @m-toman. I am facing the stream issues, and try to adopt the chunk method that you mentioned above. However, i wonder how much features should i keep from last chunk for next chunk, to make the whole result speech sound changeless. And it will be very appreciated if you could share you implementation.

npuichigo commented 5 years ago

@chazo1994 By the way, can you tell me which file has memory leak problem in SPTK?