Open mpatel31415 opened 3 months ago
https://github.com/Lightning-AI/lightning-thunder/issues/564 could be related
FYI: I was curious if the code to save checkpoint is correct in Eager mode for sure, so I used it on each rank and then compared the shapes of parameters from state_dict with the original (lit_model.pth) model, before it was wrapped with FSDP and values between ranks (to check if they were synchronized) . And it seems that both shapes and values are equal.
Small update after discussion with @carmocca about saving checkpoints from Thunder FSDP:
I tried to use save
and get_model_state_dict
functions provided by Thunder and then convert checkpoint into torch save checkpoint using dcp_to_torch_save, but I also get shape error when later trying to use the output with litgpt chat
.
Below is the code I used (I should be possible to copy it instead of the code provided in the original description):
from thunder.distributed.checkpoint import save, get_model_state_dict, StateDictOptions
from torch.distributed.checkpoint.format_utils import dcp_to_torch_save
options = StateDictOptions(full_state_dict=False, cpu_offload=False)
state_dict = get_model_state_dict(model, options, rank)
dcp_path = "/lightning-thunder/checkpoints/meta-llama/Meta-Llama-3-8B-tuned/distributed_ckp"
save(state_dict, dcp_path)
torch_dist.barrier()
if rank == 0:
dcp_to_torch_save(dcp_path, "/lightning-thunder/checkpoints/meta-llama/Meta-Llama-3-8B-tuned/lit_model.pth")
The only option that could make it work now is to train the model with Fabric FSDP, but I haven't tested it yet.
triage review:
Hi! Is there any update about this? From the Slack discussion and my understanding there were 3 options for me to progress:
Please let me know which direction is the best to follow from your perspective.
I was able to train Llama3-8b model with Thunder for a few steps and then save it. However when I try to use later
litgpt generate
orlitgpt chat
with the saved checkpoint I get an error about size mismatch. When I run the training in Eager mode everything works.π Bug
To Reproduce
Please extract this archive and put all the files into selected directory (let's call it CHECKPOINT_DIR) Meta-Llama-3-8B-tuned.zip . Here is the license.
These are Llama-3B configuration files (no weights), they can be also downloaded by running:
litgpt download meta-llama/Meta-Llama-3-8B
Copy the benchmarking script from this repo located here
thunder/benchmarks/benchmark_litgpt.py
and add model saving in line 622:To be sure that version of the script is the same, I'm also attaching the full, modified file (it's python code, but I can add only txt files here): benchmark_litgpt.txt
Let's assume it's located in SCRIPT_DIR directory.
For Eager
5E. Run training for Eager (on dummy data so output won't make sense, but it's easier to run the reproduction instructions)
You should see new file lit_model.pth in checkpoint directory.
6E. Try to chat with the saved model:
It should run but return garbage.
For Thunder
5T. You can remove the lit_model.pth (but it will be overwritten anyway) and then run:
6T. Try to chat with the saved model:
There is an error:
Complete output
Expected behavior
We should be able to run model trained with Thunder with litgpt instructions.
Environment
nvidia-smi output:
Version of packages: