ZhangXInFD / soundstorm-speechtokenizer

Implementation of SoundStorm built upon SpeechTokenizer.
MIT License
91 stars 12 forks source link

Query regarding SoundStorm USLM implementation #1

Open rishikksh20 opened 10 months ago

rishikksh20 commented 10 months ago

@ZhangXInFD Are you simply replaced the 'NAR' of USLM with trained SoundStorm speech tokenizer for zero shot TTS task ? Although quality of SoundStorm is much better Have you notice any speed advantages while using SoundStorm compare to original USLM ?

rishikksh20 commented 10 months ago

By the way thanks for training code implementation.

ZhangXInFD commented 10 months ago

Thanks for your attention!
For the first question, yes, we just simply replace the 'NAR' of USLM for zero-shot TTS task. Compared to VALL-E, the stage2 of USLM can be viewed as a semantic -> acoustic process. Therefore, we can apply advanced semantic -> acoustic techniques like SoundStorm in the stage2 to enhance the audio generation quality. This is one of the advantages of SpeechTokenizer over SoundStream and Encodec. On the other hand, compared to the genuine semantic (like HuBERT, W2V-BERT) -> acoustic (like SoundStream, Encodec) process, Benefiting from information decoupling, SoundStorm requires fewer iterations when applied to SpeechTokenizer. In fact, in our experiments, a single iteration yielded quite satisfactory generation quality. For the second question, we have not evaluated time costs of 'NAR' and SoundStorm. But in theory, since SoundStorm also needs to generate tokens layer by layer in inference, its time complexity should be on the same order of magnitude as NAR. Moreover, if SoundStorm iterates multiple times when decoding the first layer (i.e., RVQ-2), then theoretically it would take more time than 'NAR'. In our experiments, SoundStorm only iterates 1 time when decoding RVQ-2. If SoundStorm were to generate all tokens at once, its time efficiency might be higher than NAR's. However, we have not yet evaluated the audio quality produced in this manner. Once the model is fully trained in the future, we might conduct related experiments. In fact, as of now, our SoundStorm hasn't been trained to its full potential, but the results are already quite promising. The biggest advantage of SoundStorm over 'NAR' lies in the quality of the audio generation.

rishikksh20 commented 10 months ago

Yes Soundstorm yield better quality due to use of conformer, I don't aspect any speed quality as well. When I get time and resource I will train SpeechTokeizer and USLM (Soundstorm) on large LibriLight, MLS and Gigaspeech dataset, I think it will yeild production level quality. Meanwhile please share SpeechTokeizer training code if possible. Please do share fully trained sample here.

ZhangXInFD commented 10 months ago

We will soon release a SpeechTokenizer trained on a larger dataset. But the open-sourcing of the training code might face some delays. This is due to the semantic distillation process during training which required modifications to the relevant model code within fairseq. Organizing this code and contemplating the most suitable way to release it might take a significant amount of time. Given our other ongoing projects, we cannot currently estimate a timeline for the release of the training code. Some samples of voice conversion and umpromt generation are provided in here.

lifeiteng commented 9 months ago

We will soon release a SpeechTokenizer trained on a larger dataset. But the open-sourcing of the training code might face some delays. This is due to the semantic distillation process during training which required modifications to the relevant model code within fairseq. Organizing this code and contemplating the most suitable way to release it might take a significant amount of time. Given our other ongoing projects, we cannot currently estimate a timeline for the release of the training code. Some samples of voice conversion and umpromt generation are provided in here.

When will this model weight be released?

0417keito commented 8 months ago

When you replaced VALL-E's NAR with Soundstorm, did you adopt SoundStorm's mask strategy? Or did you not change your mask strategy?

ZhangXInFD commented 8 months ago

@0417keito We adopt SoundStorm's mask strategy.