How to deal with the integer values of RVQ

ZhangXInFD / SpeechTokenizer

This is the code for the SpeechTokenizer presented in the SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. Samples are presented on

Apache License 2.0

467 stars 40 forks source link

Hi author, I've been experimenting with encoding audio using your fantastic method, and I noticed that the RVQ (Residual Vector Quantization) values I obtain are integers like the follows: values

I'm curious if this is expected behavior. Additionally, I'm interested in using these encoded features for downstream tasks, but I'm unsure about how to adjust these integer values for training purposes. Would it be appropriate to apply normalization techniques such as min-max scaling or Z-Score normalization? The distribution of these encoded feature values is unknown to me, so I'm seeking guidance on how to handle them effectively for training.

Any advice or suggestions on how to deal with these encoded feature values would be greatly appreciated.

Thank you!

ZhangXInFD / SpeechTokenizer

How to deal with the integer values of RVQ #5