Compilation issue: Alan's voice introduces long compile times and failures on some machines.

aatchison commented 8 years ago

I'm not sure what is going on here becasue some machines have no problem with compiling, while others do. It seems to be in part due to heavy resource usage. I will return with error snippets as soon as I can.

zeehio commented 8 years ago

Those large files in Alan voice being compiled in parallel can eat about 300MB RAM during compilation.

If make -j 4 or similar is used on a device with 1GB of RAM there is a chance that those files are compiled simultaneously and that there is an out of memory situation.

It seems that the main benefit of embedding a voice is a shorter load time. Once @forslund has his pymimic module we will only need to load the voice once at the beginning so the main advantage of embedding it will disappear. We can then move to not-embedding by default.

forslund commented 8 years ago

@zeehio, no pressure then :)

zeehio commented 8 years ago

A possible workaround until we have a better solution:

Disable Alan voice at compile time ./configure --disable-vid_gb_ap
Copy the Mycroft voice file from the voices directory
Use mimic with the voice from a file: mimic -voice /path/to/Mycroft.flitevox

Or maybe provide already built binaries?

aatchison commented 8 years ago

Hmm, the flitevox file is just huge and much slower... Pre compiled would be an option, but what about different architectures?

zeehio commented 8 years ago

Do you have a list of architectures/OS you would like to support?

zeehio commented 8 years ago

Maybe I profile the voice loading to see where is the bottleneck and if the performance can be improved.

aatchison commented 8 years ago

Hmm, That might be a good idea. Go a head and release the build process though if you like.

m-toman commented 8 years ago

I profiled the voice loading once and if I remember correctly the main issue was at https://github.com/MycroftAI/mimic/blob/master/src/cg/cst_cg_map.c#L93 where the mcep trees are read. There are many nested calls reading a lot of numbers value by value, with an error check per call. freading larger chunks could certainly help here.

The voice loading typically takes some seconds on a background thread on a mobile device once on startup, so this wasn't a huge problem (I work for VocaliD, in case you wonder).

zeehio commented 8 years ago

Thanks for the info @m-toman, I will try to do that.

Given that you are working for VocaliD, do you know if it would be possible to train an HTS version of the Mycroft voice? Adding hts support to mimic shouldn't be hard, as there is Flite+hts_engine out there.

As you probably already know, HTS voices have a much smaller footprint (<5MB) and in my limited experience (in speech synthesis in Catalan demo) quite good quality, great for embedded apps.

(In case you wonder, I am just collaborating with mimic on my spare time, I worked with speech synthesis in the past at the TALP-UPC group under the supervision of Antonio Bonafonte -great person- and now I just spend some time on it for fun)

forslund commented 8 years ago

@zeehio if you like I can take a look at optimizing the flitevox-loading.

m-toman commented 8 years ago

Ah, I have been at the SSW 2013 (http://ssw8.talp.cat) in Barcelona :).

I also trained an HTS version but it turned out to be rather disappointing with the regular hts_engine MLSA vocoder (in research we always used STRAIGHT). Mixed excitation as in flite is much smoother (but we had to make some changes to the festvox training to get the 44.1kHz version working). But yes, also much larger due to the random forest.

Our German voice model was also much better when trained using the regular HTS demo (3 samples here: http://m-toman.github.io/SALB/). I suppose because it was recorded in studio setting with a professional speaker and manually cleaned labels. We can talk about this by email if you like - m dot toman at neuratec dot com :).

zeehio commented 8 years ago

@forslund That would be great, thanks! I am thinking that if you can move forward with pymimic once we have released a new mimic version then maybe it is worth to release right now, and push on pymimic a bit more. The main drawback we have with voice loading times is not that it is slow (few seconds), the main issue is that Mycroft is loading the voice on each mimic call (on each sentence) instead of once per session. It would be great to have pymimic as it would allow to keep the loaded voice in memory so we would not be paying the several seconds delay price on each sentence. If you feel that it is easier to get pymimic working than working on optimizations here then I suggest that we release right now, focus on having a pymimic release too and adapting Mycroft to use it. It is up to you :-)

@m-toman I will write you an email :-) I helped in the ssw8 organization (passing microphones, etc). It is a pity that there is not a better free software vocoder implementation, I know in TALP they have been working with both STRAIGHT and AHOcoder both improving the MLSA filter, but unfortunately none of them have a free software implementation. I believe they (at TALP) are also using SALB, I am sure they are thankful for it!

m-toman commented 8 years ago

Even if a bit off-topic but perhaps a discussion interesting for others too: Yes, the vocoder is a big bottleneck. I wrote a small tool to do feature extraction and resynthesis using the flite/mimic MLSA+ME vocoder and it was actually much better than regular MLSA, but still... if some other vocoder comes up, it would be interesting to integrate it, but I'm not sure how generic the flite parameter generation is. The festvox training scripts for clustergen voices are also a lot messier than the HTS demo training scripts and can hardly be parameterized (well, except sed-replacing scheme script contents).

I've also been thinking about hybrid synthesis, so replacing the vocoder with a unit selection search. In the end, probably a DNN will directly synthesize waveforms, I guess :).

Regarding SALB, yes I've been contacted with some questions on it. Back then I decided to build around flite instead of extending it because of https://sourceforge.net/p/at-flite/wiki/AddingNewLanguage/

I've also considered ICU but it seemed a bit huge and I wanted to keep the dependencies low, so I just added special treatment for UTF-8 characters for my small German text analysis. I've been using flite in SALB only for text analysis of English and attached hts_engine, with abstractions in-between. Probably if you build that into mimic, SALB becomes obsolete :).

The connection from flite to hts_engine is rather simple - there is a huge function covering the utterance structure to a hts label and a dummy voice without synthesis function. But I guess discussion on that would belong to a new issue (like the whole post, but I'm not sure where :)).

zeehio commented 8 years ago

Sorry for the offtopic issue, if I could I would split it.

After your comments I contacted Antonio Bonafonte, Asuncion Moreno both from TALP and Daniel Erro from AHOLAB and Dani sent me not one but two possible alternatives:

AhoTTS is a GPL3 speech synthesis system for Basque and Spanish based on aholab vocoder. To train the voices aholab binaries are needed though, although if we are going to train HTS voices HTK is also needed and non free...

The other solution is a free (BSD) implementation of something similar to the STRAIGHT vocoder called World. I believe it is worth looking into it.

I will open a new issue and try to see if it is possible to move these vocoder comments there ;-)

aatchison commented 8 years ago

Thanks guys. We could really use a more optimized version:D

Shallowmallow commented 6 years ago

if some other vocoder comes up, it would be interesting to integrate it, but I'm not sure how generic the flite parameter generation is.

Straight is now open source : https://github.com/HidekiKawahara/legacy_STRAIGHT

MycroftAI / mimic1

Compilation issue: Alan's voice introduces long compile times and failures on some machines. #61