Is it possible to generate speech without reference audio?

wjddd commented 1 month ago

Hi, thank you so much for your great work. Is it possible to generate speech without reference audio after training a new ssr-speech model?

Besides, I want to confirm whether the dataset format shown below is correct.

SPK_ID
├── 1.json
├── 1.wav
├── 2.json
├── 2.wav
├── 3.json
└── 3.wav

WangHelin1997 commented 1 month ago

Hi, thanks for your interest in our work!

What do you mean by generating speech without reference audio? For the speech editing, we need the reference audio as we want to edit it. For the TTS, we also need one to clone its timbre. Could you please clarify your need?

For the data preparation, please prepare files like this:

{
"segment_id": "audio1",
"wav": "/data/audio1.wav",
"trans": "I like SSR-Speech.",
}
{
"segment_id": "audio2",
"wav": "/data/audio2.wav",
"trans": "SSR-Speech can do both zero-shot speech editing and text-to-speech!",
}
...

And the dataset format can be in this way:

json_folder
├── train.json
├── test.json
├── val.json

We don't rely on speaker ids, all you need is to prepare such json files. Please update to the newest version of this repo.

wjddd commented 1 month ago

Hi, thanks for your interest in our work!

What do you mean by generating speech without reference audio? For the speech editing, we need the reference audio as we want to edit it. For the TTS, we also need one to clone its timbre. Could you please clarify your need?

For the data preparation, please prepare files like this:
{
"segment_id": "audio1",
"wav": "/data/audio1.wav",
"trans": "I like SSR-Speech.",
}
{
"segment_id": "audio2",
"wav": "/data/audio2.wav",
"trans": "SSR-Speech can do both zero-shot speech editing and text-to-speech!",
}
...
And the dataset format can be in this way:
json_folder
├── train.json
├── test.json
├── val.json
We don't rely on speaker ids, all you need is to prepare such json files. Please update to the newest version of this repo.

Hi, thanks for your reply! I ran the dataset preprocessing and training scripts successfully but encountered a problem with inference when using a new ssr-speech model. Is there any specific requirements for dataset? I used 80 wav files with an average length of about 10 secs each.

WangHelin1997 commented 1 month ago

Hi, thanks for your interest in our work! What do you mean by generating speech without reference audio? For the speech editing, we need the reference audio as we want to edit it. For the TTS, we also need one to clone its timbre. Could you please clarify your need? For the data preparation, please prepare files like this:
{
"segment_id": "audio1",
"wav": "/data/audio1.wav",
"trans": "I like SSR-Speech.",
}
{
"segment_id": "audio2",
"wav": "/data/audio2.wav",
"trans": "SSR-Speech can do both zero-shot speech editing and text-to-speech!",
}
...
And the dataset format can be in this way:
json_folder
├── train.json
├── test.json
├── val.json
We don't rely on speaker ids, all you need is to prepare such json files. Please update to the newest version of this repo.
Hi, thanks for your reply! I ran the dataset preprocessing and training scripts successfully but encountered a problem with inference when using a new ssr-speech model. Is there any specific requirements for dataset? I used 80 wav files with an average length of about 10 secs each.

Hi, what kind of error did you face? 10-second audio is fine for the model.

wjddd commented 1 month ago

Hi, thanks for your interest in our work! What do you mean by generating speech without reference audio? For the speech editing, we need the reference audio as we want to edit it. For the TTS, we also need one to clone its timbre. Could you please clarify your need? For the data preparation, please prepare files like this:
{
"segment_id": "audio1",
"wav": "/data/audio1.wav",
"trans": "I like SSR-Speech.",
}
{
"segment_id": "audio2",
"wav": "/data/audio2.wav",
"trans": "SSR-Speech can do both zero-shot speech editing and text-to-speech!",
}
...
And the dataset format can be in this way:
json_folder
├── train.json
├── test.json
├── val.json
We don't rely on speaker ids, all you need is to prepare such json files. Please update to the newest version of this repo.
Hi, thanks for your reply! I ran the dataset preprocessing and training scripts successfully but encountered a problem with inference when using a new ssr-speech model. Is there any specific requirements for dataset? I used 80 wav files with an average length of about 10 secs each.
Hi, what kind of error did you face? 10-second audio is fine for the model.

I haved trained my model for 5000 steps but it can't generate speech. I set --text_vocab_size and --text_pad_token to 200 since it's a mandarin model, and used default value for other parameters in script e830M.sh. Do you have any suggestions on how to improve this?

WangHelin1997 commented 1 month ago

Hi, thanks for your interest in our work! What do you mean by generating speech without reference audio? For the speech editing, we need the reference audio as we want to edit it. For the TTS, we also need one to clone its timbre. Could you please clarify your need? For the data preparation, please prepare files like this:
{
"segment_id": "audio1",
"wav": "/data/audio1.wav",
"trans": "I like SSR-Speech.",
}
{
"segment_id": "audio2",
"wav": "/data/audio2.wav",
"trans": "SSR-Speech can do both zero-shot speech editing and text-to-speech!",
}
...
And the dataset format can be in this way:
json_folder
├── train.json
├── test.json
├── val.json
We don't rely on speaker ids, all you need is to prepare such json files. Please update to the newest version of this repo.
Hi, thanks for your reply! I ran the dataset preprocessing and training scripts successfully but encountered a problem with inference when using a new ssr-speech model. Is there any specific requirements for dataset? I used 80 wav files with an average length of about 10 secs each.
Hi, what kind of error did you face? 10-second audio is fine for the model.
I haved trained my model for 5000 steps but it can't generate speech. I set --text_vocab_size and --text_pad_token to 200 since it's a mandarin model, and used default value for other parameters in script e830M.sh. Do you have any suggestions on how to improve this?

What do you mean that "it can't generate speech"? What kind of output is it?

From my experiments, 5000 steps are not enough to train a model. You need at least 5000 hrs to train the Mandarin model and the 'val_top10acc_cb1' should be trained over 0.58 to get ideal results. I used 25k hrs Mandarin data and trained for over 50k steps.

Could you provide your training scale and the loss, if possible?

WangHelin1997 / SSR-Speech

Is it possible to generate speech without reference audio? #6