IST-DASLab / gptq

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
https://arxiv.org/abs/2210.17323
Apache License 2.0
1.91k stars 152 forks source link

Why are PPL so low on PTB? #4

Closed EliottZemour closed 1 year ago

EliottZemour commented 1 year ago

Hello

Many thanks for your work, it's great to (finally) see results reported on openly available LLMs 😊 However, I was surprised when I saw perplexities on PTB for OPT and BLOOM models: 10.33 and 13.63 respectively. Indeed, GPT-3's paper reports a PPL of 20.50 on such dataset and I was wondering whether you had any explanation for this (nearly 2x) difference?

Thanks!

efrantar commented 1 year ago

Hi,

our perplexity calculation is based on this HuggingFace link and works as follows: concatenate all dataset samples with basic separators, then split the resulting string into non-overlapping segments of 2048 tokens (the maximum sequence length of both OPT and BLOOM) and evaluate the average causal-LM loss on those segments; the exponentiated version of this result is the perplexity. We believe this is pretty standard.

For PTB, we use this version of the dataset and proceed exactly as explained above, without any extra preprocessing. One partial explanation for our lower PPL numbers may be the fact that we insert separators (following exactly the HuggingFace link mentioned above) in between samples; for most datasets like WikiText the impact of this is very minor, but since PTB is split per-sentence without punctuation, inserting separators between sentences appears to help models significantly. After disabling this by changing "\n\n" to " " at:

https://github.com/IST-DASLab/gptq/blob/9232a476641b1848cf720dd15bd0e616cd48702d/datautils.py#L40

the PPL for FP16 OPT-175B increases from 10.33 to 13.04. Beyond that, we are not really sure what differences there might be or if OPT is just more accurate on this dataset (on a related note, the ZeroQuant paper reports 20.47 PPL on PTB for a 6B model). If you are aware of any PTB evaluation details or a reliable reference implementation, we would be happy to learn about them.

Lastly, we would like to emphasize that the focus of our work is on the accuracy drop relative to the FP16 version (rather than absolute PPL values) and that we always evaluate all FP16 and quantized models using exactly the same code (in this repository) to ensure a fair comparison.