Question regarding using own parser

MaheshVelankar commented 7 years ago

This is a question rather than an issue.

I wish to use this utility for Marathi language, for which I have my on parser to convert Unicode (utf8) Devanagari text into phonemes.

I want to know how can use it? Which part of the code should I look into in order to modify it to use my parser instead of whatever parser this utility is using.

I have just started looking into this utility. So please excuse me if this is too trivial etc.

Thanks in advance

zeehio commented 7 years ago

In order to use mimic for a new language you will need a language model (able to do phonetic transcriptions among other things) and a speech model.

The mimic-spanish module uses an external library named Saga for the phonetic transcriptions and syllabification and to do that it uses a custom lexical insertion function.

The speech synthesis in mimic is done through a series of modules (defined at src/synth/cst_synth.c) that take care of:

splitting the raw text into tokens
normalize each token into a list of words (for instance token "2" is normalized to word "two")
add some phrasing breaks
convert the each word into a list of syllables with phonemes
model intonation (mark words with emphasis...)
(...)

Depending on the speech model you have or you want to build (a clustergen voice [cmu_us_slt] , an HTS voice [cmu_us_slt_hts] a diphone voice [kal16]...) you will require different features from the language analysis part. For instance, in diphone voices you get a model for each possible pair of phones, and you need to provide a model for the duration of these diphones. In clustergen and HTS voices you extract more contextual features from the language model (the current phoneme to synthesize, the previous and next phoneme, the total number of syllables of the current word, the number of syllables in this word before the current phoneme, the number of words in the phrase...) These features are needed to improve the speech quality, as they are used to build a decision tree that chooses the most similar speech sample in the database (clustergen) or the most similar trained speech model (HTS).

Mimic (fork of Flite, in turn derived from Festival) uses an Heterogeneous Relation Graph (HRG) formalism to define and extract those linguistic features sharing as much as possible between languages. The HRG is defined in src/hrg and states that a text is divided into Utterances ("paragraphs") were each utterance can be synthesized independently. Each utterance has a set of Relations, that contain Items. For example, the utterance would have:

A "text" relation with a single item containing the raw text
A "Token" relation. Each item of the "token" relation is a Token that has the following features: the token name (e.g. "the" or "children" or "2" or "1st"), the token prepunctuation and the token postpunctuation (punctuation marks after or before the token)
A "Word" relation. Has a list of items, where each item is a words (e.g. "the", "children", "two", "first"). Each word is linked to the next and previous word and to its parent Token item.
The HRG model covers the syllabification and all the other linguistic parts described above through other relations. There is more info here: http://zeehio.github.io/speech-tools/estling.html but you will have to read mimic code, at least other language support modules, to see how this works.

For Spanish support, I tried to keep Spanish close to the HRG structure in mimic so I could reuse all the language independent functions that mimic offers to extract the HTS and clustergen features, so I inserted into the SylStructure and Segment relations the syllabification and phonetic transcription information from the Saga transcription tool. Feel free to use it as an example.

Do you have a speech model already or some recordings? I know there is some indic support in mimic (inherited from Flite) but I don't know if the speech models (voices) work well or not.

MaheshVelankar commented 7 years ago

Thanks for a very prompt response and an elaborate description (and that too very easy to understand)

I am able to build and use htsvoice in a tts system, that is using festival. I already have the parser that is used for training and synthesizing ........that converts text into syllables and phonemes............I will study how the Spanish is using their external parser and try to work on the similar lines.

I do not work on this continuously.so please excuse me if my further questions are delayed in time.

Thanks again

zeehio commented 7 years ago

Just a couple more things:

There is a sort of mimic-indic language support at https://github.com/MycroftAI/mimic-indic/, but I have not tested it. mimic-indic is the support that was available in Flite, probably inherited from some indic support in Festival.
There are some indic voices at http://www.festvox.org/flite/packed/flite-2.0/voices/ that should be compatible with mimic.
If your parser is open source, feel free to point to the code if you want hints for the conversion to mimic. There are some tools to export festvox voices to flite (and therefore to mimic as well). Checkout the flite manual as an initial reference

I also work on this whenever I can, so we will ask/answer questions to each other whenever we can. No pressure 👍

Best

vinbo8 commented 7 years ago

I don't know helpful this would be, but if you're looking for something that could handle tokenisation, Apertium has a FOSS morphological analyser for Marathi - https://sourceforge.net/p/apertium/svn/HEAD/tree/languages/apertium-mar/

MycroftAI / mimic1

Question regarding using own parser #139