SJTMusicTeam / Muskits

An opensource music processing toolkit
Apache License 2.0
312 stars 44 forks source link

How to train and infer without using forced alignment? #81

Open Cardroid opened 2 years ago

Cardroid commented 2 years ago

First of all, I would like to express my gratitude for creating a wonderful project.👍

I saw that there are various tokenizer implementations under the text folder. However, I couldn't find a recipe using these options.

I don't have a phoneme label in my own dataset. You can make it, but it would be nice if you could use it without making it.

If possible, could you tell me how to train and inference models without a phonemic label?

ftshijt commented 2 years ago

Hi, Many thanks for your interest in our projects!

We currently did not intensively test the text modules. Also, given limited data concerning the SVS, we haven't found a working solution for directly removing the dependency of phoneme information.

One potential hacking method would be equally distributing the phoneme duration in the duration of your word and letting the seq2seq model decide its duration during training. But it still requires some alignment over the word level. We have tried that for CSD corpus and it goes well with Korean syllables. I'm preparing the PR for now and will update it here shortly

Cardroid commented 2 years ago

Thank you👍 I'm looking forward to it!

Cardroid commented 2 years ago

I tried MFA based on what I experienced when I made TTS. Perhaps because the pitch is not constant, it is not well aligned.

The dataset used CSD and KoG2P as g2p modules.

image

I want to see your solution quickly. 😅