bojone / bytepiece

更纯粹、更高压缩率的Tokenizer
Apache License 2.0
437 stars 22 forks source link

tokenizer压缩率 与 模型最终效果 的关系 #15

Open nghuyong opened 6 months ago

nghuyong commented 6 months ago

在评估tokenizer的部分给出的是tokenizer自身的评估指标,比如压缩率

但是,高压缩率的tokenizer并不意味模型的效果也更好,是否能给出最终模型层面的效果?

例如:sentencepiece实验中的BLUE

https://github.com/google/sentencepiece/blob/master/doc/experiments.md#english-to-japanese

bojone commented 6 months ago

我暂时没这个算力去做这个比较实验...

但是从压缩就是智能的信仰来说,高压速率就等价于效果好(至少对于LLM来说)

nghuyong commented 6 months ago

对于LLM来说可能真不是正向的关系。例如在文章 Getting the most out of your tokenizer for pre-training and domain adaptation 中有相关的观点:

It is important to note that higher compression rates could also lead to deteriorated downstream performance, since shorter sequences give less effective FLOPs to a model to reason (Goyal et al., 2023). This is a consequence of the modern Transformer decoder architecture in which every token requires an additional forward pass to generate. Therefore even seemingly low-information tokens might still provide gains on downstream task. This is evidenced by Goyal et al. (2023), who propose Pause Tokens, special empty tokens added to the context to enable the model to 'pause' its reasoning and add FLOPs during inference.

压缩就是智能的信仰说的是 模型对信息的压缩能力 不等价 tokenizer的压缩率 ?