X-LANCE / UniCATS-CTX-txt2vec

[AAAI 2024] CTX-txt2vec, the acoustic model in UniCATS
https://cpdu.github.io/unicats
63 stars 8 forks source link

Example usage #1

Closed danablend closed 1 year ago

danablend commented 1 year ago

Thank you very much for the repository - do you have any usage examples for the different tasks such as continuation & editing? :-)

cantabile-kwok commented 1 year ago

Yes, we are going to release inference (including continuation etc.) scripts as this repo is still under update. Just that I am a little busy these days, and unlike normal TTS inference process, the continuation (as well as editing) needs the construction of contexts which needs more careful coding. Currently you can look at the continuation.py for a rough guide, but we will make official examples public later. Sorry for the delay!

danablend commented 1 year ago

Thank you so much for the reply - the code makes sense, and I got it to run!

I wanted to ask: Do you still have the trained model files used to generate the samples for the demo clips? This would be super useful to have, as training takes long.

cantabile-kwok commented 1 year ago

I am afraid that we currently don't have plans to release the pretrained checkpoint... But for the vocoder "CTX-vec2wav", a checkpoint trained on LibriTTS will be made publicly available soon, so if you need, you can stay tuned on the corresponding repo (since the vocoder might be useful for future work both on vocoding and TTS). Thank you for your understanding!

danablend commented 1 year ago

Thank you for the reply - completely understand! The CTX-vec2wav model checkpoint would be really useful!

I'm wondering how you preprocess your files for training, for extracting data manifest and vqidxs? I am unable to align shapes for cond_embed and feats durations, always a shape error of 1 to 3.

Thanks so much again for your work :-)

cantabile-kwok commented 1 year ago

When you say "unable to align shapes", do you mean for your own dataset?

Anyway, the manifests for LibriTTS has been established long ago and I can't remember what specific code we used for that. Generally, we followed the Kaldi recipes for data organization. That is, for a new dataset, we would manually construct wav.scp (that involves how you define the "utterance ID" entry) and the other necessary files such as utt2spk and text. text here refers to sentences in words. We then used the Kaldi recipe to train a HMM-GMM ASR system and force-align the speech and text files, which gave us the phoneme sequence for each utterance and the corresponding durations. As this process was also tedious, we chose to directly provide the generated files. For forced alignment, the MFA tool is also favored besides Kaldi.

On the other hand, the VQ indexes were extracted using the fairseq's model which seems to have a clear instruction. As you note that there was always a shape error of 1 to 3, actually we have experienced the same kind of issues. We think this is mainly caused from the different framing strategies Kaldi and vq-wav2vec use (e.g. discarding or keeping the last frame blah blah). I think if the length difference is always no longer than 3, you can safely cut the longer sequence to match the shorter ones as 3 frames generally isn't a big deal. When we are training the other models, sometimes we perform this truncation to keep the features aligned.

Feel free put it if there are still confusions!

danablend commented 1 year ago

Thank you, this is exactly the problem I was having and I got the truncation of the longer sequence to work. Thank you for the details - I will close this now as it is working good! :-)