Designing the Rocaloid Engine 3

Sleepwalking commented 11 years ago

As I wrote in README.md before, the next version will be totally rewritten again, like the evolution from Rocaloid1 to Rocaloid1.6.

Currently the version of RSC, CVS, and CDT format has already reached 2.x, which means they are in different version with the synthesizer. Also considering the significant change in synthesis algorithm(TDPSM -> FECSOLA), I've decided to name the next generation as "Rocaloid Engine 3" instead of 2, along with CVE 3.

Here I have to restate the definition and relations of "Rocaloid Engine":

Rocaloid Engine includes Cybervoice Engine(CVE) and CVS Generator and provides I/O of RSC(Rocaloid SCript) and CVS(CyberVoice Script).
CVE is the synthesis engine in Rocaloid Project.
CVS is the file format for storing phonetic information, can be directly synthesized by CVE. CVS contains much more detailed information(e.g. duration of each phoneme) than RSC.
RSC is the file format for the note editor, can only be synthesized by transforming into CVS, with CVS Generator and CDT.
CVS Generator is the sub program of RSCCommon(which includes CVS Gen, I/O of RSC and vsqx), which is used to transform RSC into CVS using CDT so that the RSC file can be synthesized.
CDT is the dictionary used by CVS Generator. Contains phonetic definitions, which are data derived from lots of phonetic experiments.

RSC will not be included in Rocaloid Engine anymore, because RSC is strongly related to the note editor, and dealing with editor settings and musical notations is not the business of Rocaloid Engine. RSC will be replaced by RVS(Rocaloid Vocal Script), which describes the general (but not in detail) information of notes and lyrics (but not phonemes). CVS Generator will be responsible for transforming RVS into CVS. The transformation from RSC (or .vsqx, .vsq, .ust, .nn, etc.) to RVS should be simple (does not require professional phonetics knowledge).

Altogether, the major components and formats in RE3(Rocaloid Engine 3) will be:

CVE 3
CGTOR 3 (Cvs GeneraTOR 3)
CVS 3
RVS 3
CDT 3

Additionally, CVS 3 and RVS 3 will be stored in binary instead of text. This is because formant data will be included in CVS and RVS, which will greatly increase the file size, and slow down the IO performance. (approximately a CVS 3 text file which contains a song will be 10MB)

rocaloid3

Sleepwalking commented 11 years ago

To fill up the whole phonetic system of a particular language (like Chinese), CDT 3 does multi-level mapping on CVDBs. It's possible to use only "a.cvdb" to reconstruct all vowels in a language! (though the synthesis quality might be horrible) This is achieved by FECSOLA and the mapping structure of CDT 3:

cdt3mappingstructureexample

The same way can be also applied on consonant-vowel diphones, for example, "da". The consonant part "d" will be skipped in modification and only the vowel part "a" will be transformed into other pronunciations.

Like its predecessor CDT 2.4, CDT 3 also has a syllable table, used by CGTOR 3:


...
        Entry   &uo
            Type    CAV
            TRatio  1
            TList
                T   &[  u
                T   p2  o
                T   p1  p2
            End
        End
        Entry   zun
            Type    CAVV
            TRatio  0.5
            TList
                T   z[  u
                T   p2  e
                T   p2  n
                T   p1  p2
                T   p1  p2
            End
        End
...

There is no restriction on the phonetic symbol you use for a sound db in RE3, which means you can use custom phonetic symbol system! It's also possible to achieve multi-language synthesis support because RE3 use syllables as input, not whole words from a specific language.

In Chinese we use one syllable for each word, which could be directly fitted in CDT 3: "这" - zhe “是” - shi “一” - yi “个” - ge “测” - ce “试” - shi The above words means "this is a test". To synthesis english, the above sentence is supposed to be written as: thi / s / eh / s / ah / teh / s / t Corresponding pronunciation symbol used by HatsuneMiku(CHN) 1.6.1: z3 / s# / e- i / s# / a e- / t3 / s# / t3 I've never tried those staffs, I guess it would be almost pure "Chinglish", better if we use IPA...

m13253 commented 11 years ago

BTW, is there any tool to help create a voice bank such a what UTAU does?

\n /* since sometimes GitHub does not understand the returns sent by my mail client */

I believe the future of Rocaloid is not only Hatsune Miku -- you will face copyright problems. So I suggest that let everyone build their own voicebank. (Forgive me if I am asking for something that already exists, for I have not built the new Rocaloid 3 since I stayed at school these couples of weeks)

\n

I have an idea, is there any method to have one voice bank contain data enough for multiple languages? As we see, Luka can sing Japanese and English but it requires two voice bank installed (Luka_JPN & Luka_ENG). Can we make a unified phonetic system that can describe not only Chinese but also English and Japanese and others? And can we make just one voice bank for multiple languages? I know that is somehow difficult, since different languages have some different properties such as VOT (Voice onset Time).

\n

P.S. @Sleepwalking : I wonder if I can get your private e-mail address. I can not find it on your GitHub profile page (did I missed something?) My mail is b13253 at gmail. I would like to find some books or courses on synthesizer and DSP, could you give me some suggestions over the mail? Thanks in advance.

Sleepwalking commented 11 years ago

BTW, is there any tool to help create a voice bank such a what UTAU does?

Of course there is since the very beginning of Rocaloid. In Rocaloid1 it was called CVD. In Rocaloid Renaissance (1.6) it was called TDPSMStudio. In Rocaloid Renaissance (3) it will be called CVDBStudio.

My email addr is 2657202503@qq.com.

m13253 commented 11 years ago

Why not use some int'l standard such as X-SAMPA? "This is a test." [ D I s I s @ t_h e s t ](Note that [i] and [I] are different)

\n

For Chinese, it's [ tsM V s M i k M V ts_h M V sM ], (however for Luo Tianyi, there is bugs so the result is [ ts 7 si i k 7 ts_h 7 s` M ])

Sleepwalking commented 11 years ago

Yes. X-SAMPA is OK. I just worry about so many diphones in X-SAMPA that writing the dictionary will be a laborious job.

Sleepwalking commented 11 years ago

This is the flow chart of CVE 3. Let's hope it can be displayed properly here...

cve3

m13253 commented 11 years ago

Actually due to some issues, VOCALOID's Chinese pronunciation is not always correct, such as "儿" [ @] which should be [ A ] or [ A r\ ](according to personal preference). And you can (perhaps) make some triphones such as VOCALOID do? For example "liu" [ l i@U ], VOCALOID simply made the [ i@U ] triphone. And another problem for Chinese, I believe that "an / ang" is pronounced as [ a~ ] / [ A~ ], but Wikipedia does not agree with me ([ a n ] [ A N ]). So I will do further research. If the former is better, then you can save some diphones. Did you remember my PinYin2XSampa convert utility? I found some bugs on "xuan" which ought to yield "[ s\ ya~ ]" but "[ s\ ua~ ]" instead. I will fix it ASAP.

Sleepwalking commented 11 years ago

Amendment: The version of CVDB should be 1.x, not 3.x. CVDB Data Structure: 2013-08-31 12 39 38

m13253 commented 11 years ago

@Sleepwalking wrote:

Amendment: The version of CVDB should be 1.x, not 3.x. CVDB Data Structure: [Image]

Why must you use structed binary form? It is hard to expand in the future. For example, you made the wave to be 44100Hz s16le sampled. That is far from professional audio processing. I suggest that you can use BSON format or use SQLite3 database. That is not only easy to develop in the future, but also easy to do queries.

Sleepwalking commented 11 years ago

@m13253

Vocaloid dbs are actually 44100Hz, 16bits, mono sampled.
FFT is only efficient when its size is the integer power of 2. If we change the sample rate, it would be hard to apply FFT (currently 1024 points).
The db cannot be finished in a day, I mean it's a long process. Using structured data base would be super hard to do debugs and tests. However, later I will put all CVDB files together in one file (when RE3 is finished).
Structure of db doesn't matter so much because there is a DB Mapping Layer in CDT3 which is designed to solve the problem you mentioned (other db forms).

Sleepwalking commented 11 years ago

CVS3 Structure Design

cvs3structure

m13253 commented 11 years ago

Vocaloid dbs are actually 44100Hz, 16bits, mono sampled. FFT is only efficient when its size is the integer power of 2. If we change the sample rate, it would be hard to apply FFT (currently 1024 points). The db cannot be finished in a day, I mean it's a long process. Using structured data base would be super hard to do debugs and tests. However, later I will put all CVDB files together in one file (when RE3 is finished). Structure of db doesn't matter so much because there is a DB Mapping Layer in CDT3 which is designed to solve the problem you mentioned (other db forms).

However, since you will have to resample the original wave during synthesis process, higher the sample rate is, better the result is. Why you have to implement FFT algorithm yourself? You can use libfft library to simplify your work. And I bet you have not used SQLite3 database. It is convenience and highly structed.

Sleepwalking commented 11 years ago

@m13253 I enjoy the process of Building from Scratch. And CVEDSP is faster than libfftw.

Sleepwalking commented 11 years ago

RVS3 Structure Much simpler than CVS3. rvs3structure

Sleepwalking commented 11 years ago

The whole structure of CDT3. cdt3structure

Sleepwalking / Rocaloid-old

Designing the Rocaloid Engine 3 #14