Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
10.33k stars 1.02k forks source link

Error loading converted litgpt checkpoints in `pytorch_model.bin` format using huggingface `AutoModelForCausalLM` #1086

Open jwkirchenbauer opened 7 months ago

jwkirchenbauer commented 7 months ago

Hi, we're using the litgpt framework to train models and then would like to export them to huggingface format for continued tuning and evaluation.

The steps we're using after completing training are:

  1. scripts/convert_pretrained_checkpoint.py to "finalize" the model
  2. scripts/convert_lit_checkpoint.py to conform it to the huggingface saved model format
  3. Load using transformers.AutoModelForCausalLM.from_pretrained("/path/to/converted/checkpoint/dir")

The actual load step 3. throws an error because it tries to call torch.load(checkpoint_file, weights_only=True) internally when it sees that no safetensors format checkpoint is available: transformers/modeling_utils.py#L529-L535

This can be bypassed by setting weights_only=False but this is not the desired solution, rather, it would be great if there was a way to export a trained litgpt model to model.safetensors format directly, rather than to the pytorch_model.bin file format. What do you think?

I couldn't find any mention of this hiccup within litgpt, or elsewhere really actually - the only "safetensors" related things here are on the scripts/download.py side for bringing hf safetensors format models into litgpt.

rasbt commented 7 months ago

I think exporting to .safetensors would be nice in the future. In the meantime, to address your issue, you could load it via state_dicts -- I just had wanted to try something similar and shared the approach in the tutorial here (scroll to the very bottom): https://github.com/Lightning-AI/litgpt/blob/main/tutorials/convert_lit_models.md#a-finetuning-and-conversion-tutorial

carmocca commented 7 months ago

Hi! weights_only=True shouldn't have anything to do with safetensors. Can you share the precise error that you get? There should only be weights and primitives in the state dict

jwkirchenbauer commented 7 months ago

Thanks for the interim sol'n @rasbt , I'll try that out!

@carmocca So this is the stacktrace from the error result of the torch.load operation within the transformers loading logic linked above.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "[omitted]/python3.11/site-packages/torch/serialization.py", line 1013, in load
    raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution.Do it only if you get the file from a trusted source. WeightsUnpickler error: Unsupported operand 149

(It's not a "safetensors issue", I was just noting that their control flow falls back to this loading variant if a model.safetensors file can't be found at the provided path.")

carmocca commented 7 months ago

So we need to find out what's causing the "Unsupported operand 149" to know if it's litgpt saving something that we shouldn't. Would it be possible for you to share this checkpoint? You can omit the tensor data if that's proprietary or private.

eljanmahammadli commented 7 months ago

@rasbt can you please also guide on the #1095 as well. Essentially, it is similar problem but your approach would not work as I have different config such as n_layer, n_head, n_embd.

ch0pp3rVirus commented 5 months ago

Got the same problem After converting finetuned model(qlora) from litgpt to hf format, when load hf format model will get the error:

Weights only load failed. Re-running torch.load with weights_only set to False will likely succeed, but it can result in arbitrary code execution.Do it only if you get the file from a trusted source. WeightsUnpickler error: Unsupported operand 149

But this only happens on transformers version higher than 4.36.0. When I use version 4.34.1, the converted hf format model will be loaded normally.

my finetuned model: Codellama-7b-hf-instruct

If needed, I could share the finetuned checkpoint.

skrbnv commented 2 weeks ago

Just had same error but with different repo. My checkpoint was saved using pickle but loaded with torch.load(using_weights = True). Maybe it'll be useful.