rolling ppl with sliding window

The wikitext2 perplexity calculation method is based on this Huggingface article:

It is calculated with a window size of max_seq_length = 4096 tokens. At each step, the window shifts by stride=512 tokens, and its first max_seq_length - stride. tokens are considered as context tokens. This means that their logits are not taken into account, allowing this rolling perplexity to be calculated without overlap.

I benchmarked llama2-7B with this config:

##############
# transforms #
##############
transforms: [sentencepiece]

###########
# Subword #
###########
src_subword_model: "llama/tokenizer.model"
tgt_subword_model: "llama/tokenizer.model"

#############
# Inference # 
#############

# GPU
world_size: 1
gpu_ranks: [0]
gpu: 0

seed: 42
max_length: 10
batch_type: sents
batch_size: 15

report_time: false
beam_size: 1
model: checkpoints/llama-2-7B_safetensors.pt
src: None

By running python3 run_wikitext-2_benchmark.py -config and

with fp16 precision
```
precision: fp16
```
I got a perplexity of 5.02

with fp16 precision

precision: fp16
quant_layers: ['w_1', 'w_2', 'w_3', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
quant_type: "bnb_NF4"

I got a perplexity of 5.15

It is close to the score reported here https://github.com/ggerganov/llama.cpp/discussions/2352

OpenNMT / OpenNMT-py

rolling ppl with sliding window #2553