Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
https://lightning.ai
Apache License 2.0
6.85k stars 726 forks source link

Fix `litgpt evaluate` not using the local checkpoint #1357

Closed awaelchli closed 3 weeks ago

awaelchli commented 3 weeks ago

Fixes #1349

I found that the litgpt evaluate command ignores the provided checkpoint dir and silently downloads the model from HF. Since the download speeds in Studios is so fast, I didn't notice this. The only hint that this is happening is when I saw #1349 but I interpreted this first as it just wanting to download a missing config file. Later when I looked at the benchmark numbers and saw that LoRA, QLoRA and full finetuning returned all had the same eval benchmark numbers, I was led down this rabbit hole.

The fix is to correctly pass the pretrained checkpoint file to the HFLM class. Tied to this is the problem that the huggingface state dict loader forces weights_only=True, which our checkpoints don't support because they are saved using pickle in the incremental saver. So I had to also include a workaround for this.