Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
9.02k stars 904 forks source link

On converting checkpoints to weights #830

Open carmocca opened 7 months ago

carmocca commented 7 months ago

Motivation

Whenever a script wants to load model weights, there are different variations of it that could be loaded depending on which script we are referring to:

  1. A lit model weights file lit_model.pth. This is the output of scripts/convert_hf_checkpoint.py
  2. A Fabric weights-only checkpoint. This is the output of finetune/*.py. It will include the lit model checkpoint under the model key. Example: https://github.com/Lightning-AI/lit-gpt/blob/main/finetune/lora.py#L310-L312
  3. A Fabric training checkpoint. This is the output of pretrained/*.py. It will include the lit model checkpoint under the model key plus extra training state (e.g. optimizer state).
  4. A Trainer checkpoint. This is the output of pretrained/openwebtext_trainer.py, the only script using the Trainer.

Most of our scripts support loading (1) and (2): https://github.com/search?q=repo%3ALightning-AI%2Flit-gpt%20.get(%22model&type=code https://github.com/Lightning-AI/lit-gpt/pull/803 added support for loading (3) after a conversion step.

Currently (4) cannot be loaded anywhere other than the pertaining script itself.

Pitch

This issue suggests unifying these cases by having a single interface to "process" checkpoints and checkpont_dirs. There are two ways to do it

With a previous conversion step:

Roughly:

python convert_checkpoint.py out/foobar/ converted_checkpoint/
python generate/base.py --checkpoint_dir converted_checkpoint/

Cons:

With a in-memory conversion function:

Inside generate/base.py, we call

from lit_gpt.utils import get_weights_from

state_dict = get_weights_from(ckpt_dir)

Cons:

What about configs?

If we implement #483, all weights should have a config file beside it so that it can be carried over?

What about the tokenizer vocabulary?

This will need to be manually copied over. Unless we choose to carry it over as with the configs.

Tutorials such as https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/finetune_lora.md#merging-lora-weights already indicate the need for this cp step

carmocca commented 7 months ago

cc @Andrei-Aksionov @awaelchli @lantiga @rasbt @JustinGoheen for UX suggestions and preferences

My personal preference is the in-memory conversion step.

Andrei-Aksionov commented 7 months ago

I like in-memory conversion.

Some users might prefer to have a clean checkpoint_dir to read from

In such case, we can have a map for conversion (like we have right now) and use it inside in-memory conversion and in a conversion scripts. So basically have both options, but use in-memory option by default.

If we stick to in-memory option and a user wants to save the whole model, to which format, hf or lit, it will be saved? If to hf, it will ease the process of uploading weights to huggingface hub. But in such case it might be weird to have in a Lit-GPT hf format by default. Or such a question should be raised in another issue, "On converting weights to checkpoints"?

carmocca commented 7 months ago

This issue doens't intend to change the process of converting weights weights in the Lit-GPT format to the HF format (convert_lit_checkpoint.py) or the other way around (convert_hf_checkpoint.py).

It aims to standardize the supported inputs and outputs of our scripts.

awaelchli commented 7 months ago

The proposal here makes sense to me for the use cases where you download weights from HF etc. I'd like to explain my use case though, because I am sure this in-memory approach can't replace the conversion step I introduced in https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/convert_pretrained_checkpoint.py, which triggered this discussion in the first place.

I'm currently pretraining an LLM, saving a bunch of metadata, optimizer states etc. to the checkpoint so that I can fully resume training at any point. But as soon as I take the checkpoint and use it for evaluation, inference, finetuning etc, I don't want to keep this unnecessary data around. Note: The optimizer states are a considerable fraction of the checkpoint! So for this use case, I don't see how an in-memory conversion function would be beneficial.

Furthermore, think about this: If I left the checkpoint as it is without dropping metadata, and used it for finetuning, I would "silently" load the optimizer states from the pretraining which we most likely don't want! Finally, this con does not apply to my use case:

It will create a duplicate version of the weights. This can be very annoying for large checkpoints in environments with limited disk size such as cloud instances.

Because 1) during pretraining, the most expensive thing is compute. Compared to that, storage cost is a joke. We save frequent and large checkpoints during training and don't care about space. Once training is finished, we back up a few milestone checkpoints and delete the remaining terrabytes. 2) Why deploy a checkpoint with optimizer states inside if you use it for serving (in your cloud instance with limited disk space).

So all I'm saying here is that we will still need something like https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/convert_pretrained_checkpoint.py for the pretraining, and should not be replaced by the solution proposed here in my opinion :)

murdadesmaeeli commented 6 months ago

@awaelchli , How about we implement @carmocca's solution for fine tuning workflow and do-nothing/different-solution for pre-training workflow?

A significant portion of users use lit-gpt for fine tuning purposes as pre-training is costly for them, and simplifying the fine tuning workflow can reduce a few issues the repo receives each month on the confusion mentioned in the first thread of this issue.

Sincerely, Mehrdad