The wikitext2 perplexity calculation method is based on this Huggingface article:
It is calculated with a window size of max_seq_length = 4096 tokens. At each step, the window shifts by stride=512 tokens, and its first max_seq_length - stride. tokens are considered as context tokens. This means that their logits are not taken into account, allowing this rolling perplexity to be calculated without overlap.
The wikitext2 perplexity calculation method is based on this Huggingface article:
It is calculated with a window size of
max_seq_length = 4096
tokens. At each step, the window shifts bystride=512
tokens, and its firstmax_seq_length - stride
. tokens are considered as context tokens. This means that their logits are not taken into account, allowing this rolling perplexity to be calculated without overlap.I benchmarked llama2-7B with this
config
:By running
python3 run_wikitext-2_benchmark.py -config
andwith
fp16
precisionI got a perplexity of 5.02
with
fp16
precisionI got a perplexity of 5.15
It is close to the score reported here https://github.com/ggerganov/llama.cpp/discussions/2352