Open HaiFengZeng opened 2 weeks ago
Hey @HaiFengZeng, couple of questions: (1) Which codec are you using? (2) What dataset or type of data are you training on?
The compression rate achieved with codec-bpe is largely dependent on these two factors. If the codec you are using is already using a low bitrate then you will get smaller gains from further compression (e.g., 1.3-1.5x instead of 2x). Also, if your dataset contains more diverse audio such as thousands of different speakers and recording conditions, then you will need a much larger vocab_size
to approach 2x compression.
I am planning to release some models with corresponding tokenizers within the next month or so. For now, I suggest increasing your vocab_size
and also try restricting the number of codec 4-grams (acoustic units) that can be loaded into the same token using max_token_codebook_ngrams=1
. This causes the tokenizer to avoid merging longer codes sequences, leading to a wider breadth of coverage for your coarse codebooks within your vocab_size
budget.
Some of the compression rates I've observed in testing:
Thks, I'm using funcodec, and 30k vocab_size, it's about 100tks/sec, I will try large vocab size and see if I can get better compression
@HaiFengZeng I also got same result with FunCodec, the ratio is 1, even increased the vocab size to 50k.
Oh silly me, I forgot use tokenizer.encode()
, just tested with my 5k dataset, the compression rate is 1.02x, I think it is mainly due to the size of of my dataset, will try increase the dataset size.
@indiejoseph what dataset are you using? I can run some tests and see if I get the same results. 1.02x compression seems really low.
fsicoli/common_voice_19_0 zh-HK subset
good work! I'm using 10M audio codecs to train codec-bpe with a
vocab_size=30k,num_codebook=4,codebook_size=1024
, after training I get a tokenizer and found the compression rate is nearly 1. For example, a 4s audio with 25hz, I get 400 audio codec tokens, and using codec-bpe I get about nearly the same token-size about more than 350 which is less then 1 ,this can yield savings of 2-5x in sequence length compared to directly modeling the flattened codebooks
. I follow the steps in the readme, so can you share your codec-bpe tokenizer based on encodec or some other codec?