Source code (recreate models from scratch)

zeehio commented 8 years ago

Some people, institutions and licenses define source code as the preferred format for modification.

In mimic we have some data that is shipped as C code. We distribute this C code, but it is auto generated and it is not the preferred source for modification. As an example, the lexicon (#15), the letter to sound rules or the voice models.

Far from being a simple licensing issue, this impacts the repository size and more importantly our ability to understand where the code comes from and how to fix issues fast. On the other hand having data preprocessed in the repository reduces the build dependencies (festival speech synthesis would be required) and the build time (voice models are already created).

For a fast build time and quick testing with ability to fix bugs we could include the cmulex in source format AND its autogenerated equivalent C source code (both the lexicon and letter to sound rules created from it). But this would increase the repository size a lot if a correction is made (we already are at about 500 MB!).

Another option could be to keep all raw data under version control using git-lfs (Large file support) [1] but I don't have experience in that so I don't know how easy it is to set up.

Ideas and personal preferences are very welcome

[1] https://git-lfs.github.com/

forslund commented 8 years ago

I'm still reading up on the specifics of the mimic/flite voice format and pronunciation dictionary so my opinions might be wrong, misguided, stupid or a combination of above.

Personally I think it would be great if we could let users update the pronunciation dictionary them selves but I would like mimic to stay as independent as possible without too many dependencies. My first thought is to include a standard "precompiled" C-code lexdict and detect if the user/dev has added an alternative dict and use this new dict instead of the standard.

A separate project (or subproject) could track the cmudict and include a simple build script to make it easier to contribute corrections to the mimic pronounciation dictionary. As I see it the cmudict has a licence compatible with mimic/flite so there is no real licensing issue providing the c-code version.

My personal opinion is that the current repo size is a bit on the heavy side even in this age of infinite bandwidth. I would maybe cut a couple of voices and limit the default selection to one or two with the option to download additional voices. separate the data files (like bellbird does) into a separate repositiory (using git-lfs perhaps!) might be an idea to consider.

rhdunn commented 8 years ago

For the CMU arctic voices -- awb, bdl, clb, jmk, ksp, rms and slt -- voice data is available, but the labeling is not very accurate. These have a range of errors, from misplaced phone borders (alignment errors), incorrect phoneme assignment due to accent variation (e.g. American vs Canadian vs Scottish English), or incorrect phoneme assignment due to variation between the phrase and what is actually spoken.

The http://festvox.org/11752/packed/ directory contains various example scripts (build_cg_voice, build_clunits_voice, etc.) for building voices (including flite voices) from 100 of the recordings from awb and rms. The generated voices are not good compared with the flitevox files, nor are the LPC/RES diphone voices, due to the lack of decent alignment files and sufficient diphone coverage.

NOTE: cg is clustergen, a HMM-based synthesis model based on HTS (HMM-based Speech Synthesis System) synthesis, and clunits generates LPC/RES (residual linear predictive coding) units based around diphones.

For the other voices, I don't believe the voice data is available.

LongBoolean commented 8 years ago

I have been playing around with the build process a bit. Moved all files into one directory(except for the files my system doesn't need), compiling with gcc -g -O0 -o mimic *.c -lasound -lm -lpulse-simple. I'm getting a few insights from that.

I do think that those data files(lang/cmulex/cmu_lex_entries.c, lang/cmulex/cmu_lex_num_bytes.c, lang/cmulex/cmu_lex_phones_huff_table.c, lang/cmulex/cmu_lex_data_raw.c) should be renamed, changing the extension from .c to .txt or something. Those files do not contain valid c code (most are just comma separated data) and will give compile errors when compiled without the appropriate makefiles.

forslund commented 8 years ago

I agree that we shouldn't leave them as they are. I think by going through the build scripts we can make them produce valid c code and make them easier to use without too much trouble.

Some of those can changed just a bit. For example cmu_lex_data.c includes cmu_lex_data_raw.c in the middle of a table. I would prefer to alter the scripts generating cmu_lex_data.c complete with the data that is in cmu_lex_data_raw.c. cmu_lex_data.c only contain four lines of code so it wouldn't be hard at all.

cmu_lex_num_bytes.c only conains an integer. I'd rather call it cmu_lex_num_bytes.h and let the script generate #define LEX_NUM_BYTES [generated number] and use LEX_NUM_BYTES in cmu_lex_entries.c instead of including a c-file in the middle of an assignment.

As soon as I'm certain that the build scripts are working as they're intended I can start modifying the output structure to something we can (hopefully) agree is a workable solution.

LongBoolean commented 8 years ago

@forslund those files in question are not generated by the build scripts. My guess is they are made by a external tool. I was able to get them to work by renaming them like cmu_lex_num_bytes.txt and then #include "cmu_lex_num_bytes.txt" where they are needed.

zeehio commented 8 years ago

The pull request linked above should re-create the lexicon and the letter to sound rules.

forslund commented 8 years ago

@LongBoolean I'm pretty sure they are created by the make_cmulex scripts using scripts from festvox and festival. (at least the files in my example). Some extra processing making them valid c-code would not be hard.

@zeehio excellent!

zeehio commented 8 years ago

I have changed the title of the issue to better reflect the specific issue we are dealing with.

I plan to recreate the voice models that I can, and the language analysis models needed as the starting point for internationalization

forslund commented 8 years ago

Sounds good. Give me a shout if you need help from someone that doesn't know the first thing about voice models. :)

Testing, code review. Subissues that aren't that hard :)

m-toman commented 8 years ago

Hi all, I just came across this thread because I am watching this repository. Some time ago I (rather) bruteforced German into flite+hts_engine. It was quite painful and messy, so I agree with your approach to change the method for this... anyway, I took some notes back then: https://sourceforge.net/p/at-flite/wiki/AddingNewLanguage/

Unfortunately I did that for flite+hts_engine, which was afaik based on flite 1.4 and there was no capability to load models from file. Still, perhaps you can make some use of my notes.

Good luck :), Markus

zeehio commented 8 years ago

Sorry for the delay replying, @m-toman. I will for sure take a look at your code and notes and if possible merge it into mimic. Related to #5

MycroftAI / mimic1

Source code (recreate models from scratch) #16