foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
https://pytorch.org/docs/stable/fsdp.html
Apache License 2.0
114 stars 18 forks source link

[speculator training] Support for loading different HF checkpoints for speculator training #66

Open pavi2707 opened 3 months ago

pavi2707 commented 3 months ago

For currently training a speculator using the specu-train branch, getting OOM error when trying to load a checkpoint in HuggingFace format. The model_type is "gpt_megatron". The script works fine for other Llama checkpoints with model_type "llama"

Checkpoint folder structure

Screenshot 2024-03-28 at 12 57 24 PM

Observed Error

Screenshot 2024-03-28 at 12 58 55 PM
nairbv commented 3 months ago

what are the sizes of the files, especially the pytorch_model.bin? do we have a safetensors version? how are we loading it?