Closed tuanad121 closed 5 years ago
This is good to figure out what the differences are. As a non-academic user, I don't have access to STRAIGHT. It would be good to improve the DIO algorithm for F0 extraction or be able to compute F0 with other algorithms. But is is encouraging to read that you think WORLD does a better job on the synthesis front.
What kind of data have you tried to base your conclusions on? How is the F0 extraction worse? Are the values wrong or are there more unvoiced frames? Have you compared with any other 3rd party algorithms (from snack or praat or using autocorrelation)? Have you played with some of the WORLD parameters to tailor it to the speaker? I wonder if F0 extraction would be better, if the actual F0 range of the speaker was used?
For F0 accuracy/precision, it would be great to plug STRAIGHT's f0 (and REAPER) in https://github.com/mmorise/tusk
Does anybody with access to both STRAIGHT and WORLD have some side-by-side comparison waveforms? We have been able to build some of our own voices using WORLD but we don't have STRAIGHT and a commercial license is very expensive, so I would like to hear a comparison of the same content to hear if it is better.
Any other ideas to get the buzziness out of the WORLD vocoded speech? It seems to mostly affect fricatives and /h/. Making tweaks to the BAP / VUV analysis or F0?
It would be great to compare the samples from both vocoders. I guess @ronanki must have some samples. However, in my experience the differences you hear may not generalisable since it depends a lot on the voice. I heard a rumor that in AB tests WORLD performed better than STRAIGHT. There are other type of vocoders which use glottal inverse filtering to extract the parameters from speech signals https://users.aalto.fi/~bollepb1/papers/interspeech_2016.pdf, I modified the merlin to support the GlottHMM vocoder in this #87.
Don't have samples available but I briefly compared them for one or two voices and decided that the license is not worth it. They were remarkably similar.
I wonder how the https://research.google.com/pubs/pub43336.html would fare.
Oh and I didn't test it yet but I suspect WORLD is much faster.
I don't have any specific samples but my experience is that the two are generally on par.
Some voices seem a bit better with STRAIGHT, some a bit better with WORLD.
In this paper: http://mirlab.org/conference_papers/International_Conference/ICASSP%202016/pdfs/0005535.pdf
Shinji and Junichi found world to be rated slightly higher than straight.
Thanks for the feedback. I can imagine that some voices are better in some methods than others. So far I have tried two in-house voices with the WORLD vocoder, one is a male voice and one is female. While the overall quality is similar, I hear more buzziness in the female voice. I didn't think it was worth it to purchase STRAIGHT but we would need to hear it side-by-side for a variety of voices to really be sure. I am glad to read that WORLD is considered better or equal to STRAIGHT.
I saw the earlier post about GlotHMM. I didn't realize it also contained the vocoder code, so I will take a look at it and see if I can build a merlin voice with it.
Is there a way to try out Vocaine?
Has anyone tried AHOcoder (http://aholab.ehu.es/ahocoder/info.html)? I have the binaries and can generate parameter files with it. I needed to set the ccorder to 59 to get the same dimensions as WORLD. I haven't implemented the synthesizer part yet, but can synthesize from the command line with the generated parameters files from Merlin.
This is a bit off topic however let me please ask this here.
HTS voice building toolkit has a recipe to use STRAIGHT vocoder if you have one. I did not find any recipe there to use WORLD vocoder in place of STRAIGHT.
I do not have license to use STRAIGHT. I have WORLD as installed during Merlin install process, and wish to give it a try in building HTS voice.
Does anybody here have any knowledge of experience of doing this? Can you please point me to such info?
Thanks in advance.
Both STRAIGHT and WORLD vocoders are used in Merlin. I just wonder what are critical differences between them.
I did read papers so far, here's what I found: On analysis, WORLD uses DIO algorithm. It seems not superior to F0 extraction in STRAIGHT. WORLD uses CheapTrick to extract spectra. It seems similar to STRAIGHT. Am I right? Seem like the big difference is aperiodicity definition and extraction.
On synthesis, WORLD uses excitation signal and spectra to calculate vocal cord vibration. It's different from STRAIGHT and WORLD's result is better than STRAIGHT.
What do you think about their differences? Thanks for spending your time on my issue.