Why are PPL so low on PTB?

Hi,

our perplexity calculation is based on this HuggingFace link and works as follows: concatenate all dataset samples with basic separators, then split the resulting string into non-overlapping segments of 2048 tokens (the maximum sequence length of both OPT and BLOOM) and evaluate the average causal-LM loss on those segments; the exponentiated version of this result is the perplexity. We believe this is pretty standard.

For PTB, we use this version of the dataset and proceed exactly as explained above, without any extra preprocessing. One partial explanation for our lower PPL numbers may be the fact that we insert separators (following exactly the HuggingFace link mentioned above) in between samples; for most datasets like WikiText the impact of this is very minor, but since PTB is split per-sentence without punctuation, inserting separators between sentences appears to help models significantly. After disabling this by changing "\n\n" to " " at:

https://github.com/IST-DASLab/gptq/blob/9232a476641b1848cf720dd15bd0e616cd48702d/datautils.py#L40

the PPL for FP16 OPT-175B increases from 10.33 to 13.04. Beyond that, we are not really sure what differences there might be or if OPT is just more accurate on this dataset (on a related note, the ZeroQuant paper reports 20.47 PPL on PTB for a 6B model). If you are aware of any PTB evaluation details or a reliable reference implementation, we would be happy to learn about them.

Lastly, we would like to emphasize that the focus of our work is on the accuracy drop relative to the FP16 version (rather than absolute PPL values) and that we always evaluate all FP16 and quantized models using exactly the same code (in this repository) to ensure a fair comparison.

IST-DASLab / gptq

Why are PPL so low on PTB? #4