Plachtaa / VALL-E-X

An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in https://plachtaa.github.io
MIT License
7.42k stars 747 forks source link

关于language embedding #133

Open shanhaidexiamo opened 8 months ago

shanhaidexiamo commented 8 months ago

hi,有个问题想咨询下大佬。在forward阶段,论文中描述的是,在acoustic token上加language embedding,但是推理阶段,language embedding只能加载text token上,这个问题您是怎么解的呢?我看您forward没放出来~。我的解法是forward阶段也加在了文本上,但是推理的效果很差。

希望能解惑,非常感谢。

mononokehime14 commented 8 months ago

+1同问! 我也在跑训练, 可否交流下forward加入language embedding的时候, enroll_x_lens怎么计算的呢?@shanhaidexiamo, 我直接用了x.shape[-1], 不甚了解是不是正确😓

XinleiNIU commented 7 months ago

Hi,

Have you solved this problem? I'm also confused about the way to add the language embedding.

mononokehime14 commented 7 months ago

Kind of solved. Added language embedding in training just like in inference. If you are not getting the correct training stats, it is probably because of TextTokenizer: Plachtaa uses PhonemeBpeTokenizer, lifeiteng uses another one. In the dataset prepare stage, remember to add the language ID at the two ends of the text prompt.

XinleiNIU commented 7 months ago

Kind of solved. Added language embedding in training just like in inference. If you are not getting the correct training stats, it is probably because of TextTokenizer: Plachtaa uses PhonemeBpeTokenizer, lifeiteng uses another one. In the dataset prepare stage, remember to add the language ID at the two ends of the text prompt.

Thank you so much for sharing this!

AlexSteveChungAlvarez commented 2 months ago

Why are language embeddings being added to the phoneme embeddings instead of to the acoustic embeddings? As the paper says "Concretely, we embed language IDs into dense vectors and add them to the embeddings of acoustic tokens." in the last line of section 3.3. @Plachtaa

Plachtaa commented 2 months ago

Adding language embedding to acoustic tokens doesn't make sense at all. I tend to believe this is a typo error

Why are language embeddings being added to the phoneme embeddings instead of to the acoustic embeddings? As the paper says "Concretely, we embed language IDs into dense vectors and add them to the embeddings of acoustic tokens." in the last line of section 3.3. @Plachtaa

AlexSteveChungAlvarez commented 2 months ago

Adding language embedding to acoustic tokens doesn't make sense at all. I tend to believe this is a typo error

You mean the authors wrote it wrong? Can you explain to me why it doesn't make sense at all? I'm new to "codes/codecs" and embeddings, just learning from reading papers and watching videos. That's why I wanted to see how you implemented it, but when I didn't find what the paper says I got confused.