compression rate - Githubissues

HaiFengZeng commented 2 weeks ago

good work! I'm using 10M audio codecs to train codec-bpe with a vocab_size=30k,num_codebook=4,codebook_size=1024, after training I get a tokenizer and found the compression rate is nearly 1. For example, a 4s audio with 25hz, I get 400 audio codec tokens, and using codec-bpe I get about nearly the same token-size about more than 350 which is less then 1 ,this can yield savings of 2-5x in sequence length compared to directly modeling the flattened codebooks. I follow the steps in the readme, so can you share your codec-bpe tokenizer based on encodec or some other codec?

AbrahamSanders commented 2 weeks ago

Hey @HaiFengZeng, couple of questions: (1) Which codec are you using? (2) What dataset or type of data are you training on?

The compression rate achieved with codec-bpe is largely dependent on these two factors. If the codec you are using is already using a low bitrate then you will get smaller gains from further compression (e.g., 1.3-1.5x instead of 2x). Also, if your dataset contains more diverse audio such as thousands of different speakers and recording conditions, then you will need a much larger vocab_size to approach 2x compression.

I am planning to release some models with corresponding tokenizers within the next month or so. For now, I suggest increasing your vocab_size and also try restricting the number of codec 4-grams (acoustic units) that can be loaded into the same token using max_token_codebook_ngrams=1. This causes the tokenizer to avoid merging longer codes sequences, leading to a wider breadth of coverage for your coarse codebooks within your vocab_size budget.

Some of the compression rates I've observed in testing:

EnCodec 24khz on conversational speech data (the Fisher telephone corpus) with a vocab size of 80k and 2 codebooks yielded approximately 2x compression from 150 tok/sec to ~75 tok/sec
Kyutai Mimi codec on conversational speech data (also Fisher) with a vocab size of 30k and 4 codebooks yielded approximately 1.25x compression from 50 tok/sec to ~40 tok/sec
Kyutai Mimi codec on audiobook data (libri-light unlab-6k) with a vocab size of 80k and 4 codebooks yielded approximately 1.3x compression from 50 tok/sec to ~38 tok/sec

HaiFengZeng commented 2 weeks ago

Thks, I'm using funcodec, and 30k vocab_size, it's about 100tks/sec, I will try large vocab size and see if I can get better compression

indiejoseph commented 1 week ago

@HaiFengZeng I also got same result with FunCodec, the ratio is 1, even increased the vocab size to 50k.

indiejoseph commented 1 week ago

Oh silly me, I forgot use tokenizer.encode(), just tested with my 5k dataset, the compression rate is 1.02x, I think it is mainly due to the size of of my dataset, will try increase the dataset size.

AbrahamSanders commented 1 week ago

@indiejoseph what dataset are you using? I can run some tests and see if I get the same results. 1.02x compression seems really low.

indiejoseph commented 1 week ago

fsicoli/common_voice_19_0 zh-HK subset

AbrahamSanders / codec-bpe

compression rate #2