DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.4k stars 158 forks source link

build_path_to_transcript_dict_ljspeech doesn't match official ljspeech dataset #136

Closed thoraxe closed 1 year ago

thoraxe commented 1 year ago

When you download the official LJSpeech dataset from https://keithito.com/LJ-Speech-Dataset/, you do not get any text files.

def build_path_to_transcript_dict_ljspeech():
    path_to_transcript = dict()
    for transcript_file in os.listdir("/mount/resources/speech/corpora/LJSpeech/16kHz/txt"):
        with open("/mount/resources/speech/corpora/LJSpeech/16kHz/txt/" + transcript_file, 'r', encoding='utf8') as tf:
            transcript = tf.read()
        wav_path = "/mount/resources/speech/corpora/LJSpeech/16kHz/wav/" + transcript_file.rstrip(".txt") + ".wav"
        path_to_transcript[wav_path] = transcript
    return limit_to_n(path_to_transcript)

This dict builder expects a folder structure that does not exist in the original dataset as you can download it today.

Flux9665 commented 1 year ago

Correct, the method uses an internal version that has been preprocessed for unit selection synthesis in the past in our institute.

The path to transcript dicts are the interface between the toolkit and the data, and since everyone likes to store their data in different ways, they are not generally applicable. The idea is, that if you want to train on some data, the path to transcript dict is the one thing that you have to set up yourself. You can use the path to transcript dict of the thorsten dataset as a template, I believe this one is formatted the same way as LJSpeech when it is downloaded and not further changed. Only the delimiter used in the transcription file might be different.