I need someone to write some Qt applications to test the new algorism.

Sleepwalking commented 11 years ago

Well. It's an algorism about formant modulation, which can be used to change the pronunciation of waves in the sound db and I think it extremely useful for CVE2. I called it FECSOLA (Formant Envelope Coefficent Shift and OverLap Add). Briefly it works by modifying the spectral envelope with OLA.

For example, you have the wave of "a", and you know its formant frequencies. Just put it into FECSOLA and tell it the new formant frequencies, and the modified wave comes out (which might be transformed into "i" / "o" / "e").

Obviously this algorism can be used for correcting Miku's poor Chinese pronunciation.

I'm not going to use FECSOLA in building the new db, because it takes much more efforts (lots of work to do with the new db) and increases the size of db. Instead I'm going to embed it into CVE2 and do modification in real time (by some given parameters).

So the problem is we have to figure out:

Which mis-pronunced symbol could be corrected.
The best parameters to correct these symbols.

Theoretical solutions such as observing & analyzing the spectrums would not work since we want the best output quality. So the only way is to put those symbols and formant parameters in abundant real tests and try...

The tester would be really simple. Nothing more than a few sliders (controls F0, F1, F2, F3) and pictureboxes (to show the spectrum before and after modification), and several buttons to load and play the .wav files.

I learned neithor Qt nor C++... So I would be glad if someone could help me make this application. The algorism has a C implementation, easy to port to C++.

For details and the codes of FECSOLA, I'll post them below if someone replies to this post.

Sleepwalking commented 11 years ago

@m13253 @chaserhkj

m13253 commented 11 years ago

I'm glad to help you. However, will Gtk do? I have only learned Gtk. :-P

I am preparing for an exam right now. I will devote myself to your project immediately after it.

Sleepwalking commented 11 years ago

Gtk is OK. But you have to find another lib for audio. If your exam is in 5 days, we can wait...

Actually I'm going to take TOEFL on 9/14... Struggling with Listening & Speaking...

m13253 commented 11 years ago

My exam is in two weeks, sorry.

Can you write a small command-line tool which can test your new algorithm?

Sleepwalking commented 11 years ago

Oops... I've already made one, but it's very inconvenient to test hundreds of waves for hundreds of times... And that's why I need a GUI.

m13253 commented 11 years ago

Hundreds of? Maybe what you want is a small bash script.

for i in *.wav do

blah blah blah blah

done

m13253 commented 11 years ago

I have read your request and, I know that I can do it (but only after my exam :-( )

We'll just wait and see whether there are any other ones who can help.

Sleepwalking commented 11 years ago

No, because you have to adjust the parameters as you trying to optimize the pronunciation.

m13253 commented 11 years ago

I got the idea, that is: Some sliders for formant adjustment and a button to play the result. As I have said, I know that I can do it (but only after my exam :-( ) We'll just wait and see whether there are any other ones who can help.

digited commented 11 years ago

How about this?

screenshot from 2013-08-20 14 29 42

https://github.com/digited/qtau/tree/master/tools/rocatool It can play .wavs if you drag and drop files into left or right (before/after) widgets and press "play".

What "spectrum" should be drawn? Frequency spectrum or waveform as audio plays?

torinkwok commented 11 years ago

Use .wav format?

You can use Qt 5 and QtMultimedia lib.

torinkwok commented 11 years ago

But QtMultimedia lib only supports wav format, Qt 4.x contains a lib called Phonon, but it was dropped in Qt 5.x

torinkwok commented 11 years ago

well, I'm "一个码农".

digited commented 11 years ago

Phonon has its problems, especially on Windows. I've used it before for VocaSeed code.google.com/p/vocatube-webseed-client/ in Qt4 and decided to drop it in favor of ffmpeg.

QAudioOutput from QtMultimedia plays only raw PCM, and I plan to manually decode formats in QTau to ensure that required formats will be supported no matter what OS and what codecs are installed. Currently QTau has only .wav support, .flac and .ogg will be added a bit later.

This RocaTool uses Qt5, QtMultimedia and code from QTau, so only .wav for now. Do you need support of other audio formats now?

Sleepwalking commented 11 years ago

Cool. That's almost same with my design: http://imagebin.org/268178 where S1, S2, S3 mean the strength(magnitude) of formants.

And there's no need for a stop button since all waves are less than 3 seconds. Only .wav support is OK.

This tester should only operate on one wave, which was loaded by draging in. Play button would only play the modified sound. The spectrum boxes shows the magnitude frequency spectrums before and after modification. But instead of rendering real time, they simply reflect the spectrums taken in the middle of the wave, so you will know how the modified spectrum looks like when you are dragging the sliders.

Just write a framework without FECSOLA, reserve two arrays for me to hold the spectrums. It would be better if the spectrum box can also show spectral envelopes. Your program should offer those empty functions:

LoadWav(char* path) UpdateSpectrum1(float* DestArray) UpdateSpectrum2(float* DestArray, parameters) Synthesis(float* DestWave, parameters)

Here is the structure for storing parameters:


    typedef struct FECSOLAState
    {
        float F0; //Frequency of Formant 1
        float F1; //Frequency of Formant 2
        float F2; //Frequency of Formant 3
        float F3; //Frequency of Formant 4
        float L0; //Width of Formant 1
        float L1; //Width of Formant 2
        float L2; //Width of Formant 3
        float L3; //Width of Formant 4
        float S0; //Strength of Formant 1
        float S1; //Strength of Formant 2
        float S2; //Strength of Formant 3
        float S3; //Strength of Formant 4
    } FECSOLAState;

L0, L1, L2, L3 are not used in the algorism yet, you can set them to 600 (for further expansion).

digited commented 11 years ago

http://imagebin.org/268178

I can make it look like this. Can you please define minimum/maximum limits for sliders? Or should I make get/set functions to configure them at runtime?

spectrum box can also show spectral envelopes

I can draw anything in boxes, but the problem is, I don't have any idea how to calculate data for drawing. How to get data to draw that spectral envelope?

Sleepwalking commented 11 years ago

Can you please define minimum/maximum limits for sliders?

0Hz - 6000Hz for F1, F2, F3 0 - 3 for S1, S2, S3

I don't have any idea how to calculate data for drawing.

Well. You can just add another array and I'll put the envelopes in it. And all you need to do is to overlap the envelope with the spectrum and show them in the picture box.

Sleepwalking commented 11 years ago

By the way, our IRC develop channel is freenode/#RDG. Welcome to join in.

digited commented 11 years ago

Moved to #Rocaloid since rgwan has muted #RDG.

upd tool looks like this atm...

screenshot from 2013-08-20 22 01 29

m13253 commented 11 years ago

On 2013-8-20£¬18:36£¬"digited" notifications@github.com wrote£º

How about this?

https://github.com/digited/qtau/tree/master/tools/rocatool

It can play .wavs if you drag and drop files into left or right (before/after) widgets and press "play".

Should the slider has a maximum of 8kHz or 10kHz? I think 6kHz is not enough to cover human formants.

What "spectrum" should be drawn? Frequency spectrum or waveform as audio plays?

I did not quite catch your idea above.

m13253 commented 11 years ago

Sorry that my e-mail client is not formatting properly. :-)

Now resend as below.

Should the slider has a maximum of 8kHz or 10kHz? I think 6kHz is not enough to cover human formants.

What "spectrum" should be drawn? Frequency spectrum or waveform as audio plays?

I did not quite catch your idea above.

Sleepwalking commented 11 years ago

6kHz is enough. Generally only F1, F2, F3 has effects on pronunciation, and F4, F5 are more related to the traits of the speaker. And F3 never becomes above 5kHz. Though Miku's F3 is always above the average, it still appears no more than 4500Hz. In most occasions of vocal analysis the freq range considered is 0 - 5500 Hz.

For frequencies higher than 6kHz, FECSOLA simply copies their envelope to the dest spectrum envelope buffer.

digited commented 11 years ago

ok RocaTool conforms to spec now. Needs testing, also needs to get the resulting wave from CVE somehow to play it.

upd wow, it actually works...

digited commented 11 years ago

Really works!

screenshot from 2013-08-26 15 34 05

http://www.bilibili.tv/video/av733216/

Sleepwalking commented 11 years ago

TDPSMStudio has to be transplanted from .net to Qt. https://github.com/Sleepwalking/Rocaloid/tree/Rocaloid-1.6.0-Core-ver.-(VB.Net)/RocaloidDevelopSuit/TDPSMStudio

I don't have .net, here are some screenshots of TDPSMStudio taken serveral months ago: psms1 05wa analyzersettings weditor cvqc preprocess

We're going to rename it as CVDBStudio 3. Some adaptions have to be made. Reply below or open another issue, then I'll talk in detail.

Sleepwalking commented 10 years ago

@digited I've finished all preparations for CVE3 except CVDBStudio, would you please start now?

Sleepwalking / Rocaloid-old

I need someone to write some Qt applications to test the new algorism. #13

blah blah blah blah