Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.71k stars 176 forks source link

Minor issues about model save #6

Open linziyi96 opened 1 year ago

linziyi96 commented 1 year ago
  1. Probably should save master weights (i.e., in fp32) for more precise resume.

https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/78745ffce7ea9f513b85341aa70131e732a857eb/accessory/util/misc.py#L355

  1. Maybe we can give an option to save sharded or consolidated checkpoint. Consolidating the optimizer state of large models may blow up the memory of a single node (e.g., 70B need >560GB memory for Adam optimizer states).