Issue of CLEX results - Githubissues

Hi, it's very surprising to see LLaMA can be extrapolated to 32k without training. Awesome work!

I'm the first author of CLEX. In Table 1, I found CLEX-16K's PPL is at a higher level than other methods (lossless extrapolation anyway...). So I re-evaluate CLEX-16k in the local environment following your settings (evaluation on PG19 validation set with a sliding window of 256), and it performs normally well (see the Table below).		4k	8k	16k	32k	64k
CLEX-7B-16K (In your paper)	16.74	15.08	14.28	14.70	15.10
CLEX-7B-16K (our evaluation)	8.84	7.66	7.43	7.57	8.73

After double-checking the models, we found there may be some issues when uploading the checkpoints of CLEX-7B-16K to huggingface. (Actually we don't know why, we may upload a wrong version.) We have updated the checkpoints on huggingface and feel free to re-evaluate it.

Note that the updated checkpoints totally use the setting in our paper, without any more tricks (e.g., training on PG19). And the checkpoint issue doesn't appear in CLEX-7B-Chat-16K (see the results below), which we uploaded months ago.		4k	8k	16k	32k	64k
CLEX-7B-Chat-16K	8.65	8.58	8.28	8.28	8.76

Sorry for the mistake and thank you for helping us figure it out! I would be grateful if you can consider re-evaluating and updating the results in your next-version paper, and feel free to contact me if have any concerns.

HKUNLP / ChunkLlama

Issue of CLEX results #5