935963004 / LaBraM

ICLR 2024 spotlight
195 stars 27 forks source link

unused codebook #37

Open zeydabadi opened 3 weeks ago

zeydabadi commented 3 weeks ago

Hi,

Thank you for sharing your code. During the vqnsp training I noticed this message "Unused code in codebook: 8191". Could you comment on what does this indicate?

Thank you

935963004 commented 3 weeks ago

That means the number of codes that are never used in training data.

zeydabadi commented 3 weeks ago

Thanks for your reply, but it's unclear to me what would be the implication of 8191 codes not being used. Can you please elaborate on that? is it good or bad? is there anything we can do about it?

935963004 commented 2 weeks ago

This is really bad because most of the codes are never used. It might attribute to limited data size. There are two suggested ways: 1) increase the data size. 2) reduce the codebook size.

zeydabadi commented 2 weeks ago

Thank you very much for your insights. Could you please clarify if there is any linear or other relationship between the duration of the pre-training data (measured in hours) and the size of the codebook? In your paper, you noted that approximately 2500 hours of data were used for pre-training. If I were to use around 250 hours of data for pre-training, what codebook size would you recommend?