CSTR-Edinburgh / merlin

This is now the official location of the Merlin project.
http://www.cstr.ed.ac.uk/projects/merlin/
Apache License 2.0
1.31k stars 441 forks source link

How merlin is compared to festival? #440

Closed mrgloom closed 5 years ago

mrgloom commented 5 years ago

How merlin is compared to festival? is festival is part of merlin?

simonkingedinburgh commented 5 years ago

On 17 Mar 2019, at 19:47, mrgloom wrote:

How merlin is compared to festival?

Festival is a complete text-to-speech toolkit. We originally wrote it for building concatenative systems (diphone, then unit selection, and most recently hybrid). It contains a conventional front-end for English.

Many people use only the front-end from Festival, in combination with HTS or Merlin for doing regression, and a waveform generator such as WORLD.

http://www.speech.zone/courses/one-off/merlin-interspeech2017/

is festival is part of merlin?

No.

Simon

-- Prof. Simon King Director of the Centre for Speech Technology Research Professor of Speech Processing University of Edinburgh,UK www.cstr.ed.ac.uk The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

mrgloom commented 5 years ago

is festival is part of merlin?

As I can see here festival binaries are compiled too or it's just optional and not necessary for TTS in merlin?

I wonder if there any project with concatenative TTS and natural human like voice exist? based on this samples concatenative TTS is on par with WaveNet approach, however I can't find any evidence to this fact in open projects.

Does festival or merlin have some TTS sound samples for comparison? I found some samples here but they are about 20 years old and voice sounds robotic. I have also tried festival's text2wave but found that voice is too robotic too.

Also I have found that voiceloop depends on phonemizer which have festival as one of backends.

ZackHodari commented 5 years ago

Festival is not part of Merlin, it is a separate tool. It just happens that Festival is used by Merlin and so is installed within Merlin.

Merlin just uses Festival as a TTS front-end – a tool that converts a sentence represented as characters into a sequence of phones accompanied by a lot of other information, e.g. part of speech, syllable structure, stress, intonation. Note that Festival is more than just a TTS front-end, it can perform concatenative synthesis

If you have an alternative TTS front-end that you want to use then you will not need to install Festival

The concatenate speech samples compared with WaveNet will be Google's production quality concatenative TTS, this will have undergone a lot of engineering and fine tuning to improve the quality. Additionally they will be using hybrid unit-selection, a method where the units are chosen based on models similar to what you might train in Merlin (statistical parametric speech synthesis).

If you want to train a hybrid voice it will involve a bit more work: https://cstr-edinburgh.github.io/Multisyn_unit_selection/

As for speech samples, the Blizzard speech synthesis challenge releases the participants submissions, these can be access here http://www.cstr.ed.ac.uk/projects/blizzard/data.html The relevant papers are all summarised here http://festvox.org/blizzard/