descriptinc / melgan-neurips

GAN-based Mel-Spectrogram Inversion Network for Text-to-Speech Synthesis
MIT License
980 stars 214 forks source link

about models/*.pt #3

Closed MorganCZY closed 5 years ago

MorganCZY commented 5 years ago

could you please explain what kind of datasets are used to get models "linda_johnson.pt" and "multi_speaker.pt"? Is "multi_speaker.pt" model corresponding to multispeaker in the paper? And what's for "linda_johnson.pt" model?

wezteoh commented 5 years ago

linda_johnson.pt is trained using the open source dataset https://keithito.com/LJ-Speech-Dataset/

Yes, multi_speaker.pt is trained using our internal multispeaker dataset, and is expected to generalize across new speakers.

MorganCZY commented 5 years ago

@wezteoh I tested the multi_speaker.pt model to synthesize new speakers' sounds and it did well. Have your team explored how long at least the training dataset of each speaker should be and how many speakers at least are needed to achieve such an effect? btw, for your internal multispeaker, are they all English speakers?

wezteoh commented 5 years ago

all english speakers, and we used about 10 hours each.

MorganCZY commented 5 years ago

Thx! Besides when i run this repo with LJSpeech, i find the training is too slow on a 2080Ti gpu, far from the reference time in the demo webpage.(shown in the picture) question I stripped out the mel calculating module and fed the pre-processed mel directly into training process by dataloader, but it didn't accelerate the training. Could you give me some speeding-up advice?

MorganCZY commented 5 years ago

I rewrote AudioDataset to directly feed processed audio data rather than processing wavs into audio data. The training speed is now close to the given one.

ghost commented 5 years ago

@MorganCZY so create a pull request

casper-hansen commented 4 years ago

@MorganCZY can you share your code?