HKUNLP / ChunkLlama

[ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"
Apache License 2.0
341 stars 18 forks source link

Issue of CLEX results #5

Closed guanzhchen closed 7 months ago

guanzhchen commented 7 months ago

Hi, it's very surprising to see LLaMA can be extrapolated to 32k without training. Awesome work!

I'm the first author of CLEX. In Table 1, I found CLEX-16K's PPL is at a higher level than other methods (lossless extrapolation anyway...). So I re-evaluate CLEX-16k in the local environment following your settings (evaluation on PG19 validation set with a sliding window of 256), and it performs normally well (see the Table below). 4k 8k 16k 32k 64k
CLEX-7B-16K (In your paper) 16.74 15.08 14.28 14.70 15.10
CLEX-7B-16K (our evaluation) 8.84 7.66 7.43 7.57 8.73

After double-checking the models, we found there may be some issues when uploading the checkpoints of CLEX-7B-16K to huggingface. (Actually we don't know why, we may upload a wrong version.) We have updated the checkpoints on huggingface and feel free to re-evaluate it.

Note that the updated checkpoints totally use the setting in our paper, without any more tricks (e.g., training on PG19). And the checkpoint issue doesn't appear in CLEX-7B-Chat-16K (see the results below), which we uploaded months ago. 4k 8k 16k 32k 64k
CLEX-7B-Chat-16K 8.65 8.58 8.28 8.28 8.76

Sorry for the mistake and thank you for helping us figure it out! I would be grateful if you can consider re-evaluating and updating the results in your next-version paper, and feel free to contact me if have any concerns.

ChenxinAn-fdu commented 7 months ago

Thank you for bringing this to our attention! CLEX is indeed an excellent work! We will update our paper with the new results in our next release or the camera-ready version.